【问题标题】:How would I merge these two dataframes to produce the third dataframe in Spark Scala?我将如何合并这两个数据帧以在 Spark Scala 中生成第三个数据帧?
【发布时间】:2020-12-03 14:10:43
【问题描述】:

由于无法修改 spark scala 中的特定列值,我很难加入这两个数据框视图。我想我必须以某种方式进行转置/加入,但无法弄清楚。

这是第一个数据框:

  var sample_df = Seq(("john","morning","7am"),("john","night","10pm"),("bob","morning","8am"),("bob","night","11pm"),("phil","morning","9am"),("phil","night","10pm")).toDF("person","time_of_day","wake/sleep hour")

这是第二个数据框:

  var sample_df2 = Seq(("john","6am","11pm"),("bob","7am","2am"),("phil","8am","1am")).toDF("person","morning_earliest","night_latest")

这是我要生成的结果数据框:

  var resulting_df = Seq(("john","morning","7am","6am"),("john","night","10pm","11pm"),("bob","morning","8am","7am"),("bob","night","11pm","2am"),("phil","morning","9am","8am"),("phil","night","10pm","1am")).toDF("person","time_of_day","wake/sleep hour","earliest/latest")

任何帮助将不胜感激!谢谢,祝您有美好的一天!

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:
    sample_df.createOrReplaceTempView("df1")
    sample_df2.createOrReplaceTempView("df2")
    
    spark.sql("""
    select person, time_of_day, `wake/sleep hour`, `earliest/latest`
    from (
        select person, stack(2, 'morning', morning_earliest, 'night', night_latest) as (time_of_day, `earliest/latest`)
        from df2
    ) df
    join df1
    using (time_of_day, person)
    """).show()
    
    +------+-----------+---------------+---------------+
    |person|time_of_day|wake/sleep hour|earliest/latest|
    +------+-----------+---------------+---------------+
    |  john|    morning|            7am|            6am|
    |  john|      night|           10pm|           11pm|
    |   bob|    morning|            8am|            7am|
    |   bob|      night|           11pm|            2am|
    |  phil|    morning|            9am|            8am|
    |  phil|      night|           10pm|            1am|
    +------+-----------+---------------+---------------+
    

    【讨论】:

      【解决方案2】:
      val df = sample_df
          .join(sample_df2,"person")
      
      val resulting_df = df.withColumn("earliest/latest",
          when(col("time_of_day")=== "morning", $"morning_earliest")
          .otherwise($"night_latest"))
          .drop($"morning_earliest")
          .drop($"night_latest")
      
      resulting_df.show()
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2020-06-07
        • 1970-01-01
        • 1970-01-01
        • 2023-01-18
        • 2021-03-10
        • 2021-07-07
        • 1970-01-01
        相关资源
        最近更新 更多