【发布时间】:2020-09-03 07:48:59
【问题描述】:
我有一个包含文本和类别的数据框。我想计算这些类别中常见的单词。我正在使用 nltk 删除停用词并标记化,但无法在该过程中包含该类别。下面是我的问题示例代码。
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession,Row
import nltk
spark_conf = SparkConf()\
.setAppName("test")
sc=SparkContext.getOrCreate(spark_conf)
def wordTokenize(x):
words = [word for line in x for word in line.split()]
return words
def rmstop(x):
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))
word = [w for w in x if not w in stop_words]
return word
# in actual problem I have a file which I am reading as a dataframe
# so creating a dataframe first
df = [('Happy','I am so happy today'),
('Happy', 'its my birthday'),
('Happy', 'lets have fun'),
('Sad', 'I am going to die today'),
('Neutral','I am going to office today'),('Neutral','This is my house')]
rdd = sc.parallelize(df)
rdd_data = rdd.map(lambda x: Row(Category=x[0], text=x[1]))
df_data = sqlContext.createDataFrame(rdd_data)
#convert to rdd for nltk process
df_data_rdd = df_data.select('text').rdd.flatMap(lambda x: x)
#make it lower and sentence tokenize
df_data_rdd1 = df_data_rdd.map(lambda x : x.lower())\
.map(lambda x: nltk.sent_tokenize(x))
#word tokenize
data_rdd1_words = df_data_rdd1.map(wordTokenize)
#stop word and distinct
data_rdd1_words_clean = data_rdd1_words.map(rmstop)\
.flatMap(lambda x: x)\
.distinct()
data_rdd1_words_clean.collect()
输出:['today', 'birthday', 'lets', 'die', 'house', 'happy', 'fun', 'going', 'office']
我想计算与类别相关的单词(在预处理之后)频率。例如:“today”:3,因为它出现在所有三个类别中。
【问题讨论】:
-
你能发布当前的数据框和你需要的预期输出吗?
-
嗨,我已经包含了一个示例数据框。我需要计算所有三个类别中有多少个单词,两个类别中有多少个单词,一个中有多少个单词。我的实际数据集有更多类别。
标签: apache-spark pyspark nltk rdd