【发布时间】:2020-01-26 19:22:59
【问题描述】:
我正在尝试使用 pandas 对特定列(“body”)中的单词进行词形还原。
我尝试了以下代码,发现here
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in
df['body'].head()
当我尝试运行代码时,我收到一条简单的错误消息
File "<ipython-input-41-c002479904b0>", line 33
df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in x)
^
SyntaxError: invalid syntax
我也尝试了in this post 提出的解决方案,但没有任何运气。
更新:这是目前的完整代码
import pandas as pd
import re
import string
df1 = pd.read_csv('RP_text_posts.csv')
df2 = pd.read_csv('RP_text_comments.csv')
# Renaming columns so the post part - currently 'selftext' matches the post variable in the comments - 'body'
df1.columns = ['author','subreddit','score','num_comments','retrieved_on','id','created_utc','body']
# Dropping columns that aren't subreddit or the post content
df1 = df1.drop(columns=['author','score','num_comments','retrieved_on','id','created_utc'])
df2 = df2.drop(labels=None, columns=['author', 'score', 'created_utc'])
# Combining data
df = pd.concat([df1, df2])
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')
# Lemmatizing
df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in x)
df['body'].head()`
【问题讨论】:
-
总是分享整个错误信息,
-
对不起,完整的错误信息是
File "<ipython-input-41-c002479904b0>", line 33 df['words'] = df['words'].apply(lambda x: "".join([Word(word).lemmatize() for word in x) ^ SyntaxError: invalid syntax -
那之前是什么代码,你分享的就是这里的一切吗?看起来它不应该向我抛出语法错误。
-
到目前为止,我添加了完整的代码,并更正了列名。我认为在
lamda x:之后我可能需要更改一个选项,但我不确定,并且当我通过使我的列标题与我使用的示例中指定的相匹配来测试它时没有任何运气它标记为“单词”
标签: python pandas nltk lemmatization