【发布时间】:2016-09-06 15:25:17
【问题描述】:
我正在使用Pyspark MLlib FPGrowth algorithm 做一些工作,并且有一个 rdd,其中包含每行中包含的重复事务的重复示例。这导致模型训练函数由于这些重复而引发错误。我对 Spark 还很陌生,想知道如何在 rdd 的行中删除重复的in。举个例子:
#simple example
from pyspark.mllib.fpm import FPGrowth
data = [["a", "a", "b", "c"], ["a", "b", "d", "e"], ["a", "a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
所以它看起来像:
#simple example
from pyspark.mllib.fpm import FPGrowth
data_dedup = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data_dedup)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
并且将毫无错误地运行。
提前致谢!
【问题讨论】:
-
您可以编写一个地图函数来清除重复项。此函数将一个条目作为输入并输出不重复的条目。我们称之为
f。你运行rdd.map(f())。这样做的结果应该是一个“干净的”RDD。 -
谢谢@LiMuBei,您有什么建议吗?
标签: apache-spark machine-learning pyspark data-science