【问题标题】:Pyspark - How can I restructure my output in the following format?Pyspark - 如何以以下格式重组我的输出?
【发布时间】:2020-06-16 10:39:41
【问题描述】:

我有两张表,如下图:

表 1 -

表 2 -

我希望我的输出类似于下表,并在循环中进行以下计算,直到表 2 上的所有周都经过表 1 的所有天:

table3(week1_1) = table2(week1) * table1(day1_ratio)

table3(week1_2) = table2(week1) * table1(day2_ratio)

如何做到这一点?

感谢所有帮助!

谢谢

【问题讨论】:

    标签: for-loop pyspark output


    【解决方案1】:

    试试这个-

    用 scala 编写,但只需极少改动即可移植到 pyspark

    加载输入

       val table1 = Seq(
          ("o1", "i1", 1, 0.6),
          ("o1", "i1", 2, 0.4)
        ).toDF("outlet", "item", "day", "ratio")
        table1.show(false)
        /**
          * +------+----+---+-----+
          * |outlet|item|day|ratio|
          * +------+----+---+-----+
          * |o1    |i1  |1  |0.6  |
          * |o1    |i1  |2  |0.4  |
          * +------+----+---+-----+
          */
    
        val table2 = Seq(
          ("o1", "i1", 4, 5, 6, 8)
        ).toDF("outlet", "item", "week1", "week2", "week3", "week4")
        table2.show(false)
        /**
          * +------+----+-----+-----+-----+-----+
          * |outlet|item|week1|week2|week3|week4|
          * +------+----+-----+-----+-----+-----+
          * |o1    |i1  |4    |5    |6    |8    |
          * +------+----+-----+-----+-----+-----+
          */
    

    用户火花函数

        table1.join(table2, Seq("outlet", "item"))
          .groupBy("outlet", "item")
          .pivot("day")
          .agg(
            first($"week1" * $"ratio").as("week1"),
            first($"week2" * $"ratio").as("week2"),
            first($"week3" * $"ratio").as("week3"),
            first($"week4" * $"ratio").as("week4")
          ).show(false)
    
        /**
          * +------+----+-------+-------+------------------+-------+-------+-------+------------------+-------+
          * |outlet|item|1_week1|1_week2|1_week3           |1_week4|2_week1|2_week2|2_week3           |2_week4|
          * +------+----+-------+-------+------------------+-------+-------+-------+------------------+-------+
          * |o1    |i1  |2.4    |3.0    |3.5999999999999996|4.8    |1.6    |2.0    |2.4000000000000004|3.2    |
          * +------+----+-------+-------+------------------+-------+-------+-------+------------------+-------+
          */
    

    在python中

    from pyspark.sql import functions as F
     table1.join(table2, ['outlet', 'item'])       
    .groupBy("outlet", "item")       
    .pivot("day")       
    .agg(         
    F.first(df.week1 * df.ratio).alias("week1"),         
    F.first(df.week2 * df.ratio).alias("week2"),         
    F.first(df.week3 * df.ratio).alias("week3"),         
    F.first(df.week4 * df.ratio).alias("week4")       
    ).show(truncate=False)
    

    【讨论】:

    • 有没有可能在 pyspark 中得到这个?我试过了,但没有成功。如果可以的话,真的会有所帮助!
    • 试试这个-from pyspark.sql import functions as F table1.join(table2, ['outlet', 'item']) .groupBy("outlet", "item") .pivot("day") .agg( F.first(df.week1 * df.ratio).alias("week1"), F.first(df.week2 * df.ratio).alias("week2"), F.first(df.week3 * df.ratio).alias("week3"), F.first(df.week4 * df.ratio).alias("week4") ).show(truncate=False)。尚未在 pyspark 上执行,但它应该可以工作
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多