【问题标题】:spark sql string to timestamp missing millisecondsspark sql字符串到时间戳丢失毫秒
【发布时间】:2019-09-13 16:32:02
【问题描述】:

为什么:

import spark.implicits._
  val content = Seq(("2019", "09", "11","17","16","54","762000000")).toDF("year", "month", "day", "hour", "minute", "second", "nano")
  content.printSchema
  content.show
  content.withColumn("event_time_utc", to_timestamp(concat('year, 'month, 'day, 'hour, 'minute, 'second), "yyyyMMddHHmmss"))
    .withColumn("event_time_utc_millis", to_timestamp(concat('year, 'month, 'day, 'hour, 'minute, 'second, substring('nano, 0, 3)), "yyyyMMddHHmmssSSS"))
    .select('year, 'month, 'day, 'hour, 'minute, 'second, 'nano,substring('nano, 0, 3), 'event_time_utc, 'event_time_utc_millis)
    .show

错过毫秒?

+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
|year|month|day|hour|minute|second|     nano|substring(nano, 0, 3)|     event_time_utc|event_time_utc_millis|
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
|2019|   09| 11|  17|    16|    54|762000000|                  762|2019-09-11 17:16:54|  2019-09-11 17:16:54|
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+

格式字符串为:yyyyMMddHHmmssSSS,如果我没记错的话,应该包括SSS 中的毫秒数。

【问题讨论】:

  • 你的 spark 版本是什么
  • 2.2.2 是我的 spark 版本。
  • 好的,是的,在 spark

标签: apache-spark apache-spark-sql timestamp milliseconds format-string


【解决方案1】:

我也遇到过类似的问题,官方Document 说下面一行直到spark :

将时间字符串转换为 Unix 时间戳(以秒为单位) 格式(见 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 到 Unix 时间戳(以秒为单位),如果失败则返回 null。

这意味着它只处理几秒钟。

Spark>= 2.4 也可以处理SSS

解决方案:下面的UDF将有助于处理这种情况:

import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import scala.util.{Try, Success, Failure}

val getTimestampWithMilis: ((String , String) => Option[Timestamp]) = (input, frmt) => input match {
  case "" => None
  case _ => {
    val format = new SimpleDateFormat(frmt)
    Try(new Timestamp(format.parse(input).getTime)) match {
      case Success(t) => Some(t)
      case Failure(_) => None
    }    
  }
}

val getTimestampWithMilisUDF = udf(getTimestampWithMilis)

你的例子:

val content = Seq(("2019", "09", "11","17","16","54","762000000")).toDF("year", "month", "day", "hour", "minute", "second", "nano")
val df = content.withColumn("event_time_utc", concat('year, 'month, 'day, 'hour, 'minute, 'second, substring('nano, 0, 3)))
df.show
+----+-----+---+----+------+------+---------+-----------------+
|year|month|day|hour|minute|second|     nano|   event_time_utc|
+----+-----+---+----+------+------+---------+-----------------+
|2019|   09| 11|  17|    16|    54|762000000|20190911171654762|
+----+-----+---+----+------+------+---------+-----------------+

df.withColumn("event_time_utc_millis", getTimestampWithMilisUDF($"event_time_utc", lit("yyyyMMddHHmmssSSS"))).show(1, false)
+----+-----+---+----+------+------+---------+-----------------+-----------------------+
|year|month|day|hour|minute|second|nano     |event_time_utc   |event_time_utc_millis  |
+----+-----+---+----+------+------+---------+-----------------+-----------------------+
|2019|09   |11 |17  |16    |54    |762000000|20190911171654762|2019-09-11 17:16:54.762|
+----+-----+---+----+------+------+---------+-----------------+-----------------------+

root
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- second: string (nullable = true)
 |-- nano: string (nullable = true)
 |-- event_time_utc: string (nullable = true)
 |-- event_time_utc_millis: timestamp (nullable = true)

【讨论】:

  • 好的。这很清楚;) 但更多的维护和更慢。我希望有一个更好的解决方案,但我的 spark 版本显然是不可能的。
  • 是的,开源的好处 :) 制作了通用功能。可以根据需要缩短。
【解决方案2】:

尝试在此标准中连接:“yyyy-MM-dd HH:mm:ss.ssss”(它忽略零,例如:“762000000”,因为纳/毫秒变为“762”)

youDataframe
.withColumn("dateTime_complete", 
concat_ws(" ", concat_ws("-", col("year"), col("month"), col("day")),
        concat_ws(":", col("hour"), col("minute"), concat_ws(".", col("second"), col("nano")))))
.withColumn("your_new_column", to_utc_timestamp(col("dateTime_complete"), "yyyy-MM-dd HH:mm:ss.sss"))

【讨论】:

  • 为什么:.withColumn("your_new_column_ts", to_timestamp(col("dateTime_complete"), "yyyy-MM-dd HH:mm:ss.sss")) 会为空?
  • 由于输入是 UTC,因此额外转换为 UTC 不会改变输出,但似乎没有必要。而是返回 NULL。
猜你喜欢
  • 2021-05-07
  • 1970-01-01
  • 2016-06-05
  • 2018-09-02
  • 1970-01-01
  • 2016-08-01
  • 1970-01-01
  • 1970-01-01
  • 2021-08-04
相关资源
最近更新 更多