如何在 Pyspark 中获取映射值？答案

【问题标题】：How to get mapped values in Pyspark?如何在 Pyspark 中获取映射值？
【发布时间】：2020-10-06 18:44:56
【问题描述】：

我有一个用于进行所有计算的数据框，它有一个 id 列和一个 name 列

id | name
1  | Alex
2  | Bob
3  | Chris
4  | Kevin

我做了一堆操作，得到了他们最亲密的朋友，这是[id, score]形式的对列表

id | friends
1  | [[2, 49], [3, 15]]
2  | [[4, 61], [2, 49], [3, 4]]

如何将此朋友列表映射到姓名列表？现在可以降分了。理想情况下它看起来像

id | friends
1  | [Bob, Chris]
2  | [Kevin, Bob, Chris]

我正在考虑某种形式的加入，但我很困惑这将如何工作，因为它是一个列表

【问题讨论】：

你的 pyspark 版本是多少？
@anky 2.4.6 版

标签： python dataframe apache-spark pyspark

【解决方案1】：

也许这很有用（用scala写的）

如果维度数据框代表大小数据集，则添加两个选项

1。加载两个数据帧

 val data1 =
      """
        |id | name
        |1  | Alex
        |2  | Bob
        |3  | Chris
        |4  | Kevin
      """.stripMargin

    val stringDS1 = data1.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df1 = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS1)
    df1.show(false)
    df1.printSchema()
    /**
      * +---+-----+
      * |id |name |
      * +---+-----+
      * |1  |Alex |
      * |2  |Bob  |
      * |3  |Chris|
      * |4  |Kevin|
      * +---+-----+
      *
      * root
      * |-- id: integer (nullable = true)
      * |-- name: string (nullable = true)
      */

    val df2 =
      spark.sql(
        """
          |select id, friends from values
          | (1, array(named_struct('id', 2, 'score', 49), named_struct('id', 3, 'score', 15))),
          | (2, array(named_struct('id', 4, 'score', 61), named_struct('id', 2, 'score', 49), named_struct('id', 3,
          | 'score', 4)))
          | T(id, friends)
        """.stripMargin)
    df2.show(false)
    df2.printSchema()
    /**
      * +---+--------------------------+
      * |id |friends                   |
      * +---+--------------------------+
      * |1  |[[2, 49], [3, 15]]        |
      * |2  |[[4, 61], [2, 49], [3, 4]]|
      * +---+--------------------------+
      *
      * root
      * |-- id: integer (nullable = false)
      * |-- friends: array (nullable = false)
      * |    |-- element: struct (containsNull = false)
      * |    |    |-- id: integer (nullable = false)
      * |    |    |-- score: integer (nullable = false)
      */

2。如果维度数据很大


    // if df1 has big data
    val exploded = df2.select($"id", explode(expr("friends.id")).as("friend_id"))
      exploded.join(df1, exploded("friend_id")===df1("id"))
      .groupBy(exploded("id"))
      .agg(collect_list($"name").as("friends"))
      .show(false)
    /**
      * +---+-------------------+
      * |id |friends            |
      * +---+-------------------+
      * |2  |[Bob, Chris, Kevin]|
      * |1  |[Bob, Chris]       |
      * +---+-------------------+
      */

3。如果维度数据框很小

    // if df1 is small
    val b = spark.sparkContext.broadcast(df1.collect().map{case Row(id: Int, name: String) => id -> name}.toMap)

    val getFriendsName = udf((idArray: mutable.WrappedArray[Int]) => idArray.map(b.value(_)))

    df2.withColumn("friends", getFriendsName(expr("friends.id")))
      .show(false)

    /**
      * +---+-------------------+
      * |id |friends            |
      * +---+-------------------+
      * |1  |[Bob, Chris]       |
      * |2  |[Kevin, Bob, Chris]|
      * +---+-------------------+
      */

【讨论】：

对于更大的数据集，您也将退回到爆炸和收集。
既然我们没有收集indefinite length 列表，我觉得没关系，不是吗？
朋友可能有大约 10 个条目，所以这是完美的。谢谢

【解决方案2】：

看看这个

df = spark.createDataFrame([('1' , 'Alex'), ('2' , 'Bob'), ('3' , 'Chris'), ('4' , 'Kevin')],['id' , 'name'])
df2 = spark.createDataFrame([('1',[[2, 49], [3, 15]]), ('2',[[4, 61], [2, 49], [3, 4]])], ['id' , 'friends'])

df3 = df2.select('id','friends' ,f.expr('''explode(transform(friends,x->x[0])) as friend'''))

df3.join(df,df.id.cast('int')==df3.friend.cast('int')).groupBy(df3.id,df3.friends).agg(f.collect_list('name').alias('friend')).show()

+---+--------------------+-------------------+
| id|             friends|             friend|
+---+--------------------+-------------------+
|  1|  [[2, 49], [3, 15]]|       [Chris, Bob]|
|  2|[[4, 61], [2, 49]...|[Chris, Kevin, Bob]|
+---+--------------------+-------------------+

【讨论】：

寻找更好的方法@murtihash 我们可以做到不爆炸然后收集为列表
只是一个想法，如果数据集df很小，创建一个地图并广播它在transform内部使用。
@SomeshwarKale 如果数据集足够大怎么办，只是好奇是否可以以更好的方式完成