【问题标题】:How to join multiple columns from one DataFrame with another DataFrame如何将一个 DataFrame 中的多个列与另一个 DataFrame 连接
【发布时间】:2018-07-07 21:43:11
【问题描述】:

我有两个 DataFrames 推荐和电影。推荐中的列 rec1-rec3 表示来自电影数据帧的电影 ID。

val recommendations: DataFrame = List(
        (0, 1, 2, 3),
        (1, 2, 3, 4),
        (2, 1, 3, 4)).toDF("id", "rec1", "rec2", "rec3")

val movies = List(
        (1, "the Lord of the Rings"),
        (2, "Star Wars"),
        (3, "Star Trek"),
        (4, "Pulp Fiction")).toDF("id", "name")

我想要什么:

+---+------------------------+------------+------------+
| id|                    rec1|        rec2|        rec3|
+---+------------------------+------------+------------+
|  0|   the Lord of the Rings|   Star Wars|   Star Trek|
|  1|               Star Wars|   Star Trek|Pulp Fiction|
|  2|   the Lord of the Rings|   Star Trek|   Star Trek|
+---+------------------------+------------+------------+

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:

    我想通了。您应该为列创建别名,就像在 SQL 中一样。

      val joined = recommendation
        .join(movies.select(col("id").as("id1"), 'name.as("n1")), 'id1 === recommendation.col("rec1"))
        .join(movies.select(col("id").as("id2"), 'name.as("n2")), 'id2 === recommendation.col("rec2"))
        .join(movies.select(col("id").as("id3"), 'name.as("n3")), 'id3  === recommendation.col("rec3"))
        .select('id, 'n1, 'n2, 'n3)
      joined.show()
    

    查询将导致

    +---+--------------------+---------+------------+
    | id|                  n1|       n2|          n3|
    +---+--------------------+---------+------------+
    |  0|the Lord of the R...|Star Wars|   Star Trek|
    |  1|           Star Wars|Star Trek|Pulp Fiction|
    |  2|the Lord of the R...|Star Trek|Pulp Fiction|
    +---+--------------------+---------+------------+
    

    【讨论】:

      【解决方案2】:

      我们还可以使用函数stack()pivot() 来达到您的预期输出,只需将两个数据帧连接一次。

      // First rename 'id' column to 'ids' avoid duplicate names further downstream
      val moviesRenamed = movies.withColumnRenamed("id", "ids")
      
      recommendations.select($"id", expr("stack(3, 'rec1', rec1, 'rec2', rec2, 'rec3', rec3) as (rec, movie_id)"))
        .where("rec is not null")
        .join(moviesRenamed, col("movie_id") === moviesRenamed.col("ids"))
        .groupBy("id")
        .pivot("rec")
        .agg(first("name"))
        .show()
      +---+--------------------+---------+------------+
      | id|                rec1|     rec2|        rec3|
      +---+--------------------+---------+------------+
      |  0|the Lord of the R...|Star Wars|   Star Trek|
      |  1|           Star Wars|Star Trek|Pulp Fiction|
      |  2|the Lord of the R...|Star Trek|Pulp Fiction|
      +---+--------------------+---------+------------+
      

      【讨论】:

        猜你喜欢
        • 2022-12-12
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-10-14
        相关资源
        最近更新 更多