spark add_month 无法按预期工作[重复]答案

【问题标题】：spark add_month doesn't work as expected [duplicate]spark add_month 无法按预期工作[重复]
【发布时间】：2017-11-03 15:28:36
【问题描述】：

在数据框中，我正在生成基于 DateType 格式“yyyy-MM-dd”的列 A 的列。 A 列从 UDF 生成（udf 生成最近 24 个月的随机日期）。

从生成的日期开始，我尝试计算 B 列。B 列是 A 列减去 6 个月。前任。 A 中的 2017-06-01 是 B 中的 2017-01-01。为此，我使用函数 add_months(columname, -6)

当我使用另一列（不是由 udf 生成）执行此操作时，我得到了正确的结果。但是当我在生成的列上执行此操作时，我得到随机值，完全错误。

我检查了架构，列来自 DateType

这是我的代码：

val test = df.withColumn("A", to_date(callUDF("randomUDF")))
val test2 = test.select(col("*"), add_months(col("A"), -6).as("B"))

我的 UDF 代码：

sqlContext.udf.register("randomUDF", () => {

//prepare dateformat
val formatter = new SimpleDateFormat("yyyy-MM-dd")

//get today's date as reference 
val today = Calendar.getInstance()
val now = today.getTime()

//set "from" 2 years from now
val from = Calendar.getInstance()
from.setTime(now)
from.add(Calendar.MONTH, -24)

// set dates into Long
val valuefrom = from.getTimeInMillis()
val valueto = today.getTimeInMillis()

//generate random Long between from and to
val value3 = (valuefrom + Math.random()*(valueto - valuefrom))

// set generated value to Calendar and format date
val calendar3 = Calendar.getInstance()
calendar3.setTimeInMillis(value3.toLong)
formatter.format(calendar3.getTime()
}

UDF 按预期工作，但我认为这里出了点问题。我在另一列（未生成）上尝试了 add_months 函数，效果很好。

我使用此代码获得的结果示例：

A            |      B
2017-10-20   |   2016-02-27
2016-05-06   |   2015-05-25
2016-01-09   |   2016-03-14
2016-01-04   |   2017-04-26

使用火花版本 1.5.1 使用 Scala 2.10.4

【问题讨论】：

标签： scala date apache-spark apache-spark-sql user-defined-functions

【解决方案1】：

在你的代码中创建test2dataframe

val test2 = test.select(col("*"), add_months(col("A"), -6).as("B"))

被spark视为

val test2 = df.withColumn("A", to_date(callUDF("randomUDF"))).select(col("*"), add_months(to_date(callUDF("randomUDF")), -6).as("B"))

所以你可以看到udf 函数被调用了两次。 df.withColumn("A", to_date(callUDF("randomUDF"))) 正在生成 column A 中的日期。而add_months(to_date(callUDF("randomUDF")), -6).as("B") 再次调用udf 函数并生成一个新日期并从中减去6 个月并在column B 中显示该日期。

这就是你得到随机日期的原因。

解决方案是在test dataframe 中使用persist 或cache

val test = df.withColumn("A", callUDF("randomUDF")).cache()
val test2 = test.as("table").withColumn("B", add_months($"table.A", -6))

【讨论】：