【问题标题】:Spark dataframe from Json string with nested key来自带有嵌套键的 Json 字符串的 Spark 数据帧
【发布时间】:2021-07-18 22:11:58
【问题描述】:

我有几列要从 json 字符串中提取。但是,一个字段具有嵌套值。不知道如何处理?

需要分解成多行获取字段名、Value1、Value2的值。

import spark.implicits._

val df = Seq(
  ("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
  ("2", """{"p": "bar", "q": 3.0}""", "some_other_field_2"),
  ("3",
    """{"nestedKey":[ {"field name":"name1","Value1":false,"Value2":true},
      |                 {"field name":"name2","Value1":"100","Value2":"200"}
      |]}""".stripMargin, "some_other_field_3")

).toDF("id","json","other")

df.show(truncate = false)
val df1= df.withColumn("id1",col("id"))
  .withColumn("other1",col("other"))
  .withColumn("k",get_json_object(col("json"),"$.k"))
  .withColumn("v",get_json_object(col("json"),"$.v"))
  .withColumn("p",get_json_object(col("json"),"$.p"))
  .withColumn("q",get_json_object(col("json"),"$.q"))
  .withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
    .select("id1","other1","k","v","p","q","nestedKey")
df1.show(truncate = false)

【问题讨论】:

    标签: json scala dataframe apache-spark apache-spark-sql


    【解决方案1】:

    您可以使用from_json 解析nestedKey 并将其分解:

    val df2 = df1.withColumn(
        "nestedKey", 
        expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
    ).select("*", "nestedKey.*").drop("nestedKey")
    
    df2.show
    +---+------------------+----+----+----+----+----------+------+------+
    |id1|            other1|   k|   v|   p|   q|field name|Value1|Value2|
    +---+------------------+----+----+----+----+----------+------+------+
    |  1|some_other_field_1| foo| 1.0|null|null|      null|  null|  null|
    |  2|some_other_field_2|null|null| bar| 3.0|      null|  null|  null|
    |  3|some_other_field_3|null|null|null|null|     name1| false|  true|
    |  3|some_other_field_3|null|null|null|null|     name2|   100|   200|
    +---+------------------+----+----+----+----+----------+------+------+
    

    【讨论】:

    • 非常感谢。是否可以将这些嵌套键(如 field name)别名为“我的字段”等
    • 是的,您可以使用.withColumnRenamed重命名列
    【解决方案2】:

    我在一个数据帧中做到了

     val df1= df.withColumn("id1",col("id"))
        .withColumn("other1",col("other"))
        .withColumn("k",get_json_object(col("json"),"$.k"))
        .withColumn("v",get_json_object(col("json"),"$.v"))
        .withColumn("p",get_json_object(col("json"),"$.p"))
        .withColumn("q",get_json_object(col("json"),"$.q"))
        .withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
      .withColumn(
        "nestedKey",
        expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
      ).withColumn("fieldname",col("nestedKey.field name"))
        .withColumn("valueone",col("nestedKey.Value1"))
        .withColumn("valuetwo",col("nestedKey.Value2"))
       .select("id1","other1","k","v","p","q","fieldname","valueone","valuetwo")```
    
    
    still working to make it more elegant
    

    【讨论】:

      猜你喜欢
      • 2019-04-11
      • 2023-03-15
      • 2020-04-05
      • 2012-08-15
      • 2021-09-14
      • 2021-10-24
      • 1970-01-01
      • 2016-04-19
      • 1970-01-01
      相关资源
      最近更新 更多