【问题标题】:How to work around the immutable data frames in Spark/Scala?如何解决 Spark/Scala 中的不可变数据帧?
【发布时间】:2018-08-16 07:37:54
【问题描述】:

我正在尝试将以下 pyspark 代码转换为 scala。如您所知,scala 中的数据帧是不可变的,这限制了我转换以下代码:

pyspark 代码:

 time_frame = ["3m","6m","9m","12m","18m","27m","60m","60m_ab"]
 variable_name = ["var1", "var2", "var3"....., "var30"]
 train_df = sqlContext.sql("select * from someTable")

 for var in variable_name:
     for tf in range(1,len(time_frame)):
         train_df=train_df.withColumn(str(time_frame[tf]+'_'+var), fn.col(str(time_frame[tf]+'_'+var))+fn.col(str(time_frame[tf-1]+'_'+var)))

因此,正如您在上面看到的,表格具有用于重新创建更多列的不同列。然而,Spark/Scala 中数据帧的不可变特性令人反对,您能帮我解决一些问题吗?

【问题讨论】:

    标签: scala apache-spark pyspark apache-spark-sql pyspark-sql


    【解决方案1】:

    这是一种方法,首先使用for-comprehension 生成由列名对组成的元组列表,然后使用foldLeft 遍历列表以通过withColumn 迭代变换trainDF

    import org.apache.spark.sql.functions._
    
    val timeframes: Seq[String] = ???
    val variableNames: Seq[String] = ???
    
    val newCols = for {
      vn <- variableNames
      tf <- 1 until timeframes.size
    } yield (timeframes(tf) + "_" + vn, timeframes(tf - 1) + "_" + vn)
    
    val trainDF = spark.sql("""select * from some_table""")
    
    val resultDF = newCols.foldLeft(trainDF)( (accDF, cs) =>
      accDF.withColumn(cs._1, col(cs._1) + col(cs._2))
    )
    

    要测试上述代码,只需提供示例输入并创建表some_table

    val timeframes = Seq("3m", "6m", "9m")
    val variableNames = Seq("var1", "var2")
    
    val df = Seq(
      (1, 10, 11, 12, 13, 14, 15),
      (2, 20, 21, 22, 23, 24, 25),
      (3, 30, 31, 32, 33, 34, 35)
    ).toDF("id", "3m_var1", "6m_var1", "9m_var1", "3m_var2", "6m_var2", "9m_var2")
    
    df.createOrReplaceTempView("some_table")
    

    ResultDF 应如下所示:

    resultDF.show
    // +---+-------+-------+-------+-------+-------+-------+
    // | id|3m_var1|6m_var1|9m_var1|3m_var2|6m_var2|9m_var2|
    // +---+-------+-------+-------+-------+-------+-------+
    // |  1|     10|     21|     33|     13|     27|     42|
    // |  2|     20|     41|     63|     23|     47|     72|
    // |  3|     30|     61|     93|     33|     67|    102|
    // +---+-------+-------+-------+-------+-------+-------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-05-26
      • 2017-09-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-04-07
      相关资源
      最近更新 更多