【问题标题】:Add new StructField to Array of StructFields in scala将新的 StructField 添加到 Scala 中的 StructField 数组
【发布时间】:2020-12-10 09:39:22
【问题描述】:

我有一个定义为 json 数据的架构

val gpsSchema: StructType = 
  StructType(Array(
    StructField("Name",StringType,true),
    StructField("GPS", ArrayType(
      StructType(Array(
          StructField("TimeStamp",DoubleType,true),
          StructField("Longitude", DoubleType, true),
          StructField("Latitude",DoubleType,true)
          )),true),true)))

数据

{"Name":"John","GPS":[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052}, 
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038}, 
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022}, 
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]}

如何将新的 StructField "ID" (uid) 添加到 GPS 数组中,以便

之前

[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052}, 
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038}, 
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022}, 
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]

之后

[{"ID": 123,"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052}, 
{"ID": 123, "TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038}, 
{"ID": 123,"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022}, 
{"ID": 123,"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]

一种方法是展平嵌套字段,添加新列“ID”,使用 struct("ID","TimeStamp","Longitude","Latitude") 并执行 collect_list 如下:-

Dataframe
.withColumn( "ID", uuid())
.withColumn("GPS", explode($"GPS"))
.select($"ID", $"Name", $"GPS.*")
.select($"Name" ,struct("ID","TimeStamp","Longitude","Latitude").alias("field"))
.groupBy("Name").agg(collect_list($"field"))

如果数组中有大量元素可能导致火花驱动器崩溃,这将是一项昂贵的操作

还有其他方法可以在现有架构的 GPS 数组中添加“ID”字段吗?

【问题讨论】:

  • 火花版??
  • Spark 2.4.5、Scala 2.11

标签: json scala apache-spark struct schema


【解决方案1】:

如果您不想使用explodegroupBycollect_list,请尝试以下代码。

scala> df.printSchema
root
 |-- GPS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Latitude: double (nullable = true)
 |    |    |-- Longitude: double (nullable = true)
 |    |    |-- TimeStamp: double (nullable = true)
 |-- Name: string (nullable = true)


scala> :paste
// Entering paste mode (ctrl-D to finish)

val addCol = udf((id:String,json:String) => {
    import org.json4s._
    import org.json4s.jackson.JsonMethods._
    implicit val formats = DefaultFormats
    import org.json4s.JsonDSL._
    compact(parse(json).extract[List[Map[String,String]]].map(m => m ++ Map("id" -> id)))
})


// Exiting paste mode, now interpreting.

addCol: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3755/1542889653@a4d1d2c,StringType,List(Some(class[value[0]: string]), Some(class[value[0]: string])),None,true,true)

scala> df.withColumn("GPS_New",addCol(uuid,to_json($"GPS"))).show(false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|GPS                                                                                                                                                                                     |Name|GPS_New                                                                                                                                                                                                                                                                                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[40.787052, -76.463684, 1.605449171259277E9], [40.787038, -76.464046, 1.605449175743052E9], [40.787022, -76.464465, 1.605449180932659E9], [40.787054, -76.464977, 1.605449187288478E9]]|John|[{"Latitude":"40.787052","Longitude":"-76.463684","TimeStamp":"1.605449171259277E9","id":"123"},{"Latitude":"40.787038","Longitude":"-76.464046","TimeStamp":"1.605449175743052E9","id":"123"},{"Latitude":"40.787022","Longitude":"-76.464465","TimeStamp":"1.605449180932659E9","id":"123"},{"Latitude":"40.787054","Longitude":"-76.464977","TimeStamp":"1.605449187288478E9","id":"123"}]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

【讨论】:

  • 嗨,Srinivas,函数中的“GPS_New”列也将值作为字符串,如双引号中所示。如何更改如下“id”:“123”,“时间戳”:1586966278.699324,“经度”:-74.05301617127735,“纬度”:41.94429181915331
  • 使用addColudf,里面我正在转换成json。
  • 在您的“GPS_New”示例中,addcol 函数将双精度类型转换为字符串 {"Latitude":"40.787054","Longitude":"-76.464977","TimeStamp":"1.605449187288478E9 ","id":"123"}。有没有办法将时间戳、纬度、经度保持为双精度类型,id 仅作为字符串
猜你喜欢
  • 2019-04-06
  • 1970-01-01
  • 2020-11-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-12-19
  • 2021-12-11
  • 2019-06-22
相关资源
最近更新 更多