【发布时间】:2021-11-11 05:05:58
【问题描述】:
我是 pyspark 的新手,我可以使用一些指导。所以我正在处理一些文本数据,最终我想摆脱在整个语料库中出现频率不够或出现频率过高的单词。
数据看起来像这样,每一行都包含一个句子:
+--------------------+
| cleaned|
+--------------------+
|China halfway com...|
|MCI overhaul netw...|
|script kiddy join...|
|look Microsoft Mo...|
|Americans appear ...|
|Oil Eases Venezue...|
|Americans lose be...|
|explosion Echo Na...|
|Bush tackle refor...|
|jail olympic pool...|
|coyote sign RW Jo...|
|home pc key Windo...|
|bomb defuse Blair...|
|Livermore need ...|
|hat ring fast Wi ...|
|Americans dutch s...|
|Insect Vibrations...|
|Britain sleepwalk...|
|Ron Regan Jr Kind...|
|IBM buy danish fi...|
+--------------------+
所以基本上我使用split() 从pyspark.sql.functions 拆分字符串,然后计算每个单词的出现次数,提出一些标准并创建需要删除的单词列表。
然后我使用以下函数
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def remove_stop_words(list_of_tokens, list_of_stopwords):
'''
A very simple fuction that takes in a list of word tokens and then gets rid of words that are in stopwords list
'''
return [token for token in list_of_tokens if token not in list_of_stopwords]
def udf_remove_stop_words(list_of_stopwords):
'''
creates a udf that takes in a list of stop words and passes them onto remove_stop_words
'''
return udf(lambda x: remove_stop_words(x, list_of_stopwords))
wordsNoStopDF = splitworddf.withColumn('removed', udf_remove_stop_words(list_of_words_to_get_rid)(col('split')))
list_of_words_to_get_rid 是我要删除的单词列表,该管道的输入如下所示
+--------------------+
| split|
+--------------------+
|[China, halfway, ...|
|[MCI, overhaul, n...|
|[script, kiddy, j...|
|[look, Microsoft,...|
|[Americans, appea...|
|[Oil, Eases, Vene...|
|[Americans, lose,...|
|[explosion, Echo,...|
|[Bush, tackle, re...|
|[jail, olympic, p...|
+--------------------+
only showing top 10 rows
输出如下所示,并带有相应的架构
+--------------------+--------------------+
| split| removed|
+--------------------+--------------------+
|[China, halfway, ...|[China, halfway, ...|
|[MCI, overhaul, n...|[MCI, overhaul, n...|
|[script, kiddy, j...|[script, join, fo...|
|[look, Microsoft,...|[look, Microsoft,...|
|[Americans, appea...|[Americans, appea...|
|[Oil, Eases, Vene...|[Oil, Eases, Vene...|
|[Americans, lose,...|[Americans, lose,...|
|[explosion, Echo,...|[explosion, Echo,...|
|[Bush, tackle, re...|[Bush, tackle, re...|
|[jail, olympic, p...|[jail, olympic, p...|
|[coyote, sign, RW...|[coyote, sign, Jo...|
|[home, pc, key, W...|[home, pc, key, W...|
|[bomb, defuse, Bl...|[bomb, defuse, Bl...|
|[Livermore, , , n...|[Livermore, , , n...|
|[hat, ring, fast,...|[hat, ring, fast,...|
|[Americans, dutch...|[Americans, dutch...|
|[Insect, Vibratio...|[tell, Good, Time...|
|[Britain, sleepwa...|[Britain, big, br...|
|[Ron, Regan, Jr, ...|[Ron, Jr, Guy, , ...|
|[IBM, buy, danish...|[IBM, buy, danish...|
+--------------------+--------------------+
root
|-- split: array (nullable = true)
| |-- element: string (containsNull = true)
|-- removed: string (nullable = true)
所以我的问题是如何将列 removed 变成像 split 这样的数组?我希望使用explode 来计算单词出现次数,但我似乎不太清楚该怎么做。我尝试使用regex_replace 去掉括号,然后用, 拆分字符串作为要拆分的模式,但这似乎只在remove 列中添加一个括号。
我可以对正在使用的函数进行一些更改,以使它们返回一个字符串数组,如列split。
这里的任何指导将不胜感激!
【问题讨论】:
标签: python apache-spark pyspark apache-spark-sql