【发布时间】:2016-09-23 09:53:32
【问题描述】:
我创建了一个包含要词干的句子的数据框。 我想使用 Snowballstemmer 通过我的分类算法获得更高的准确度。我怎样才能做到这一点?
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
# Use English stemmer.
stemmer = SnowballStemmer("english")
# Sentences to be stemmed.
data = ["programmers program with programming languages", "my code is working so there must be a bug in the interpreter"]
# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed'])
# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()
# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)
# Print dataframe.
df
+----+---------------------------------------------------------------+
| | unstemmed |
|----+---------------------------------------------------------------|
| 0 | ['programmers', 'program', 'with', 'programming', 'languages']|
| 1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must', |
| | 'be', 'a', 'bug', 'in', 'the', 'interpreter'] |
+----+---------------------------------------------------------------+
【问题讨论】:
-
此列的类型是什么?字符串(=句子)还是字符串数组(=单词)?不要将词干分析器应用于一个句子,而是一次一个单词。
标签: python pandas nlp stemming