Pyspark - 如何以以下格式重组我的输出？答案

【问题标题】：Pyspark - How can I restructure my output in the following format?Pyspark - 如何以以下格式重组我的输出？
【发布时间】：2020-06-16 10:39:41
【问题描述】：

我有两张表，如下图：

表 1 -

表 2 -

我希望我的输出类似于下表，并在循环中进行以下计算，直到表 2 上的所有周都经过表 1 的所有天：

table3(week1_1) = table2(week1) * table1(day1_ratio)

table3(week1_2) = table2(week1) * table1(day2_ratio)

如何做到这一点？

感谢所有帮助！

谢谢

【问题讨论】：

标签： for-loop pyspark output

【解决方案1】：

试试这个-

用 scala 编写，但只需极少改动即可移植到 pyspark

加载输入

   val table1 = Seq(
      ("o1", "i1", 1, 0.6),
      ("o1", "i1", 2, 0.4)
    ).toDF("outlet", "item", "day", "ratio")
    table1.show(false)
    /**
      * +------+----+---+-----+
      * |outlet|item|day|ratio|
      * +------+----+---+-----+
      * |o1    |i1  |1  |0.6  |
      * |o1    |i1  |2  |0.4  |
      * +------+----+---+-----+
      */

    val table2 = Seq(
      ("o1", "i1", 4, 5, 6, 8)
    ).toDF("outlet", "item", "week1", "week2", "week3", "week4")
    table2.show(false)
    /**
      * +------+----+-----+-----+-----+-----+
      * |outlet|item|week1|week2|week3|week4|
      * +------+----+-----+-----+-----+-----+
      * |o1    |i1  |4    |5    |6    |8    |
      * +------+----+-----+-----+-----+-----+
      */

用户火花函数

    table1.join(table2, Seq("outlet", "item"))
      .groupBy("outlet", "item")
      .pivot("day")
      .agg(
        first($"week1" * $"ratio").as("week1"),
        first($"week2" * $"ratio").as("week2"),
        first($"week3" * $"ratio").as("week3"),
        first($"week4" * $"ratio").as("week4")
      ).show(false)

    /**
      * +------+----+-------+-------+------------------+-------+-------+-------+------------------+-------+
      * |outlet|item|1_week1|1_week2|1_week3           |1_week4|2_week1|2_week2|2_week3           |2_week4|
      * +------+----+-------+-------+------------------+-------+-------+-------+------------------+-------+
      * |o1    |i1  |2.4    |3.0    |3.5999999999999996|4.8    |1.6    |2.0    |2.4000000000000004|3.2    |
      * +------+----+-------+-------+------------------+-------+-------+-------+------------------+-------+
      */

在python中

from pyspark.sql import functions as F
 table1.join(table2, ['outlet', 'item'])       
.groupBy("outlet", "item")       
.pivot("day")       
.agg(         
F.first(df.week1 * df.ratio).alias("week1"),         
F.first(df.week2 * df.ratio).alias("week2"),         
F.first(df.week3 * df.ratio).alias("week3"),         
F.first(df.week4 * df.ratio).alias("week4")       
).show(truncate=False)

【讨论】：

有没有可能在 pyspark 中得到这个？我试过了，但没有成功。如果可以的话，真的会有所帮助！
参考-spark.apache.org/docs/latest/api/python/…
试试这个-from pyspark.sql import functions as F table1.join(table2, ['outlet', 'item']) .groupBy("outlet", "item") .pivot("day") .agg( F.first(df.week1 * df.ratio).alias("week1"), F.first(df.week2 * df.ratio).alias("week2"), F.first(df.week3 * df.ratio).alias("week3"), F.first(df.week4 * df.ratio).alias("week4") ).show(truncate=False)。尚未在 pyspark 上执行，但它应该可以工作