【发布时间】:2018-08-16 07:37:54
【问题描述】:
我正在尝试将以下 pyspark 代码转换为 scala。如您所知,scala 中的数据帧是不可变的,这限制了我转换以下代码:
pyspark 代码:
time_frame = ["3m","6m","9m","12m","18m","27m","60m","60m_ab"]
variable_name = ["var1", "var2", "var3"....., "var30"]
train_df = sqlContext.sql("select * from someTable")
for var in variable_name:
for tf in range(1,len(time_frame)):
train_df=train_df.withColumn(str(time_frame[tf]+'_'+var), fn.col(str(time_frame[tf]+'_'+var))+fn.col(str(time_frame[tf-1]+'_'+var)))
因此,正如您在上面看到的,表格具有用于重新创建更多列的不同列。然而,Spark/Scala 中数据帧的不可变特性令人反对,您能帮我解决一些问题吗?
【问题讨论】:
标签: scala apache-spark pyspark apache-spark-sql pyspark-sql