【问题标题】:Spark/scala how to subtract the current column's values from the previous column's?Spark / scala如何从前一列中减去当前列的值?
【发布时间】:2020-08-07 00:40:50
【问题描述】:

我有一个如下所示的数据框:

+--------------+-------+-------+-------+-------+-------+-------+-------+
|Country/Region| 3/7/20| 3/8/20| 3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
+--------------+-------+-------+-------+-------+-------+-------+-------+
|       Senegal|      0|      4|     10|     18|     27|     31|     35|
+--------------+-------+-------+-------+-------+-------+-------+-------+
|       Tunisia|      1|      8|     15|     21|     37|     42|     59|
+--------------+-------+-------+-------+-------+-------+-------+-------+

对于每个国家/地区,我都有一个唯一的行,但我有很多列代表天数。 我想遍历每一列并从中减去上一列中的相应值,例如生成的df应该如下:

+--------------+-------+-------+-------+-------+-------+-------+-------+
|Country/Region| 3/7/20| 3/8/20| 3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
+--------------+-------+-------+-------+-------+-------+-------+-------+
|       Senegal|      0|      4|      6|      8|      9|      4|      4|
+--------------+-------+-------+-------+-------+-------+-------+-------+
|       Tunisia|      1|      7|      7|      6|     16|      5|     17|
+--------------+-------+-------+-------+-------+-------+-------+-------+

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:

    也许这有帮助-

    df2.show(false)
        df2.printSchema()
        /**
          * +--------------+------+------+------+-------+-------+-------+-------+
          * |Country/Region|3/7/20|3/8/20|3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
          * +--------------+------+------+------+-------+-------+-------+-------+
          * |Senegal       |0     |4     |10    |18     |27     |31     |35     |
          * |Tunisia       |1     |8     |15    |21     |37     |42     |59     |
          * +--------------+------+------+------+-------+-------+-------+-------+
          *
          * root
          * |-- Country/Region: string (nullable = true)
          * |-- 3/7/20: integer (nullable = true)
          * |-- 3/8/20: integer (nullable = true)
          * |-- 3/9/20: integer (nullable = true)
          * |-- 3/10/20: integer (nullable = true)
          * |-- 3/11/20: integer (nullable = true)
          * |-- 3/12/20: integer (nullable = true)
          * |-- 3/13/20: integer (nullable = true)
          */
    
        val new_df = df2.withColumn("01/01/70", lit(0))
        val tuples = new_df.schema.filter(_.dataType.isInstanceOf[NumericType])
          .map(_.name)
          .map(c => {
          val sdf = new SimpleDateFormat("MM/dd/yy")
          (sdf.parse(c), c)
        }).sortBy(_._1)
          .map(_._2)
          .sliding(2, 1)
          .map(seq => (col(seq.last) - col(seq.head)).as(seq.last))
    
        new_df.select(col("Country/Region") +: tuples.toSeq: _* )
          .show(false)
    
        /**
          * +--------------+------+------+------+-------+-------+-------+-------+
          * |Country/Region|3/7/20|3/8/20|3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
          * +--------------+------+------+------+-------+-------+-------+-------+
          * |Senegal       |0     |4     |6     |8      |9      |4      |4      |
          * |Tunisia       |1     |7     |7     |6      |16     |5      |17     |
          * +--------------+------+------+------+-------+-------+-------+-------+
          */
    

    【讨论】:

    • 感谢您的回答,但变量 col 来自 .map(seq => (col(seq.last) - col(seq.head)).as(seq.last) ) ?当我尝试运行它时它不会被识别
    • functions.col 导入它
    • 感谢您的帮助
    • 如果有帮助,请随时接受 +upvote..meta.stackexchange.com/a/5235/767994
    猜你喜欢
    • 2019-04-14
    • 1970-01-01
    • 1970-01-01
    • 2022-01-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-01-04
    相关资源
    最近更新 更多