【问题标题】:Syntax error when lemmatizing column in pandas在 Pandas 中对列进行词形还原时出现语法错误
【发布时间】:2020-01-26 19:22:59
【问题描述】:

我正在尝试使用 pandas 对特定列(“body”)中的单词进行词形还原。

我尝试了以下代码,发现here

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer 
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()

df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in 
df['body'].head()

当我尝试运行代码时,我收到一条简单的错误消息

File "<ipython-input-41-c002479904b0>", line 33
  df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in x)
   ^
SyntaxError: invalid syntax

我也尝试了in this post 提出的解决方案,但没有任何运气。

更新:这是目前的完整代码

import pandas as pd
import re
import string


df1 = pd.read_csv('RP_text_posts.csv')
df2 = pd.read_csv('RP_text_comments.csv')
# Renaming columns so the post part - currently 'selftext' matches the post variable in the comments - 'body'
df1.columns = ['author','subreddit','score','num_comments','retrieved_on','id','created_utc','body']
# Dropping columns that aren't subreddit or the post content
df1 = df1.drop(columns=['author','score','num_comments','retrieved_on','id','created_utc'])
df2 = df2.drop(labels=None, columns=['author', 'score', 'created_utc'])
# Combining data
df = pd.concat([df1, df2])

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')

# Lemmatizing
df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in x) 
df['body'].head()`

【问题讨论】:

  • 总是分享整个错误信息,
  • 对不起,完整的错误信息是File "&lt;ipython-input-41-c002479904b0&gt;", line 33 df['words'] = df['words'].apply(lambda x: "".join([Word(word).lemmatize() for word in x) ^ SyntaxError: invalid syntax
  • 那之前是什么代码,你分享的就是这里的一切吗?看起来它不应该向我抛出语法错误。
  • 到目前为止,我添加了完整的代码,并更正了列名。我认为在lamda x: 之后我可能需要更改一个选项,但我不确定,并且当我通过使我的列标题与我使用的示例中指定的相匹配来测试它时没有任何运气它标记为“单词”

标签: python pandas nltk lemmatization


【解决方案1】:

它错过了 lambda 函数的结尾:

df['words'] = df['words'].apply(lambda x: "".join([Word(word).lemmatize() for word in x])) 

更新 该行应该更像那样,但您只能通过一个 pos(形容词,或动词,或...)进行词形还原:

df['words'] = df['body'].apply(lambda x: " ".join([wordnet_lemmatizer.lemmatize(word) for word in word_tokenize(x)]))
print(df.head()))

如果你想要更多,你可以试试下面的代码:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')


def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)



# Lemmatizing
df['words'] = df['body'].apply(lambda x: lemmatize_sentence(x))
print(df.head())

df 结果:

            body                    |        words

0  Best scores, good cats, it rocks | Best score , good cat , it rock

1          You received best scores |          You receive best score

2                         Good news |                       Good news

3                          Bad news |                        Bad news

4                    I am loving it |                    I be love it

5                    it rocks a lot |                   it rock a lot

6     it is still good to do better |     it be still good to do good

【讨论】:

  • 对不起,这是我复制代码时的错误。即使进行了更正,它也不起作用。
猜你喜欢
  • 1970-01-01
  • 2020-03-11
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-01-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多