【问题标题】:spark scala : Convert Array of Struct column to String columnspark scala:将结构列数组转换为字符串列
【发布时间】:2017-11-03 17:57:53
【问题描述】:

我有一个列,它是从 json 文件推导出的数组 类型。 我想将数组 转换为字符串,这样我就可以在 hive 中保持这个数组列原样并将其作为单个列导出到 RDBMS。

temp.json

{"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
{"value":"296160"},"sku_id":
{"value":"312002"}}],"user_id":"6666","zip_code":"666"}}

处理:

scala> val temp = spark.read.json("s3://check/1/temp1.json")
temp: org.apache.spark.sql.DataFrame = [properties: struct<items:
array<struct<invoicid:struct<value:string>,job_id:struct<value:string>,sku_id:struct<value:string>>>, user_id: string ... 1 more field>]

    scala> temp.printSchema
    root
     |-- properties: struct (nullable = true)
     |    |-- items: array (nullable = true)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- invoicid: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |    |    |-- job_id: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |    |    |-- sku_id: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |-- user_id: string (nullable = true)
     |    |-- zip_code: string (nullable = true)


scala> temp.select("properties").show
+--------------------+
|          properties|
+--------------------+
|[WrappedArray([[9...|
+--------------------+


scala> temp.select("properties.items").show
+--------------------+
|               items|
+--------------------+
|[[[923659],[29616...|
+--------------------+


scala> temp.createOrReplaceTempView("tempTable")

scala> spark.sql("select properties.items  from tempTable").show
+--------------------+
|               items|
+--------------------+
|[[[923659],[29616...|
+--------------------+

我怎样才能得到这样的结果:

+-----------------------------------------------------------------------------------------+
|               items                                                                     |
+-----------------------------------------------------------------------------------------+
[{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}] |
+-----------------------------------------------------------------------------------------+

获取数组元素值而不做任何更改。

【问题讨论】:

  • [{"invoiceid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002" }}]

标签: arrays json scala apache-spark


【解决方案1】:

to_json 是您要查找的函数

import org.apache.spark.sql.functions.to_json:

val df = spark.read.json(sc.parallelize(Seq("""
  {"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
  {"value":"296160"},"sku_id":
  {"value":"312002"}}],"user_id":"6666","zip_code":"666"}}""")))


df
  .select(get_json_object(to_json($"properties"), "$.items").alias("items"))
  .show(false)
+-----------------------------------------------------------------------------------------+
|items                                                                                    |
+-----------------------------------------------------------------------------------------+
|[{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}]|
+-----------------------------------------------------------------------------------------+

【讨论】:

  • 如何提取附加到根结构的所有列?例如,如果“属性”不存在,我希望 select(get_json_object(to_json(($".*")),"$.value")) 可以工作。但它没有。
  • to_json(struct(df.columns map col: _*))
猜你喜欢
  • 1970-01-01
  • 2021-12-14
  • 2017-11-25
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-07-16
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多