【问题标题】:Find unique tuples in Pyspark RDD在 Pyspark RDD 中查找唯一元组
【发布时间】:2017-03-11 08:02:13
【问题描述】:

我在 pyspark 的购物平台上有一个 rdd 的用户活动数据:

user_id | product_id | 事件(查看产品、购买、添加到购物车等)

问题是相同的 (user_id, product_id) 元组可以有多种事件类型。我想在同一行收集所有此类事件。

例子:

╔═════════════════════════════════════════════════╗
║ user_id    |  product_id             |   Event  ║
╠═════════════════════════════════════════════════╣
║ 1               1                     viewed    ║
║ 1               1                     purchased ║
║ 2               1                     added     ║
║ 2               2                     viewed    ║
║ 2               2                     added     ║
╚═════════════════════════════════════════════════╝

我想要:

╔════════════════════════════════════════════════╗
║ user_id | product_id |      Event              ║
╠════════════════════════════════════════════════╣
║ 1          1          {viewed, purchased}      ║
║ 2          1          {added}                  ║
║ 2          2          {viewed, added}          ║
╚════════════════════════════════════════════════╝

【问题讨论】:

  • 您是否考虑过使用内置的mapgroupByKey 函数?

标签: python apache-spark mapreduce pyspark


【解决方案1】:

在 Scala 中应该是这样的:

val grouped : RDD[((user_id, product_id), Iterable[Event])]= rdd.map(triplet => ((triplet._1, triplet._2), triplet._3)).groupByKey()

【讨论】:

    【解决方案2】:

    如果您需要尝试Dataframe,请查看:-

    import pyspark.sql.functions as F
    rdd = sc.parallelize([[1, 1, 'viewed'],[1, 1, 'purchased'],[2, 1, 'added'],[2, 2, 'viewed'],[2, 2, 'added']])
    df = rdd.toDF(['user_id', 'product_id', 'Event'])
    df.groupby(['user_id', 'product_id']).agg(F.collect_set("Event")).show()
    

    如果喜欢关注rdd,请查看:-

    rdd = sc.parallelize([[1, 1, 'viewed'],[1, 1, 'purchased'],[2, 1, 'added'],[2, 2, 'viewed'],[2, 2, 'added']])
    rdd.groupBy(lambda x:(x[0],x[1])).map(lambda x:(x[0][0], x[0][1], map(lambda x:x[2], list(x[1])) )).collect()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多