Python 词干提取（使用 pandas 数据框）答案

【问题标题】：Python stemming (with pandas dataframe)Python 词干提取（使用 pandas 数据框）
【发布时间】：2016-09-23 09:53:32
【问题描述】：

我创建了一个包含要词干的句子的数据框。我想使用 Snowballstemmer 通过我的分类算法获得更高的准确度。我怎样才能做到这一点？

import pandas as pd
from nltk.stem.snowball import SnowballStemmer

# Use English stemmer.
stemmer = SnowballStemmer("english")

# Sentences to be stemmed.
data = ["programmers program with programming languages", "my code is working so there must be a bug in the interpreter"] 
    
# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed']) 

# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()

# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)

# Print dataframe.
df 

+----+---------------------------------------------------------------+
|    | unstemmed                                                     |
|----+---------------------------------------------------------------|
|  0 | ['programmers', 'program', 'with', 'programming', 'languages']|
|  1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must',        |  
|    |  'be', 'a', 'bug', 'in', 'the', 'interpreter']                |
+----+---------------------------------------------------------------+

【问题讨论】：

此列的类型是什么？字符串（=句子）还是字符串数组（=单词）？不要将词干分析器应用于一个句子，而是一次一个单词。

标签： python pandas nlp stemming

【解决方案1】：

您必须对每个单词应用词干并将其存储到“词干”列中。

df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
df = df.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
df # Print dataframe.

+----+--------------------------------------------------------------+
|    | stemmed                                                      |
|----+--------------------------------------------------------------|
|  0 | ['program', 'program', 'with', 'program', 'languag']         |
|  1 | ['my', 'code', 'is', 'work', 'so', 'there', 'must',          |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpret']                 |
+----+--------------------------------------------------------------+

【讨论】：

对不起，如果我有点笨，我对 Python 和 Stack Overflow 还是有点陌生。
好的。这是因为您是在 for 循环中完成的。删除for w in data[["stemmed"]]:，它应该可以工作。
apply 方法旨在将函数应用于数据帧的所有行/列。所以你不必迭代行/列。有关更多信息，您可以查看文档：link
删除第一个 for 循环后，我仍然得到相同类型的错误：imgur.com/AUaaqmM
你能在执行应用之前给我看一下数据框吗？