类似 csv 的输入文本到 json 字符串答案

【问题标题】：csv-like input text to json string类似 csv 的输入文本到 json 字符串
【发布时间】：2017-06-26 15:50:57
【问题描述】：

我有一个类似 csv 的输入文件，如下所示：

"2017-06-01T01:01:01Z";"{\"name\":\"aaa\",\"properties\":{"\"propA\":\"some value\",\"propB\":\"other value\"}}"
"2017-06-01T01:01:01Z";"{\"name\":\"bbb\",\"properties\":{"\"propB\":\"some value\","\"propC\":\"some value\",\"propD\":\"other value\"}}"

我想得到这样的 json 字符串，以便我可以从纯 json 字符串创建数据框：

[{
  "createdTime": "...",
  "value":{
    "name":"...",
    "properties": {
      "propA":"...",
      "propB":"..."
    }
  }
},{
  "createdTime": "...",
  "value":{
    "name":"...",
    "properties": {
      "propB":"...",
      "propC":"...",
      "propD":"..."
    }
  }
}]

这是半结构化数据。某些行可能具有属性 A，但其他行可能具有属性 A。

如何在 Spark 中使用 Scalar 执行此操作？

【问题讨论】：

标签： apache-spark scalar

【解决方案1】：

根据我从您的问题中了解到的是，您想从您拥有的类似 csv 的文件中创建 dataframe。如果我猜对了，以下是你可以做的事情

val data = sc.textFile("path to your csv-like file")
val jsonrdd = data.map(line => line.split(";"))
  .map(array => "{\"createdTime\":"+array(0)+",\"value\":"+ array(1).replace(",\"", ",").replace("\\\"", "\"").replace("\"{", "{").replace("{\"\"", "{\"").replace("}\"", "}")+"},")

val df = sqlContext.read.json(jsonrdd)
df.show(false)

你应该有dataframe

+--------------------+----------------------------------------------+
|createdTime         |value                                         |
+--------------------+----------------------------------------------+
|2017-06-01T01:01:01Z|[aaa,[some value,other value,null,null]]      |
|2017-06-01T01:01:01Z|[bbb,[null,some value,some value,other value]]|
+--------------------+----------------------------------------------+

高于dataframe's schema 将是

root
 |-- createdTime: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- properties: struct (nullable = true)
 |    |    |-- propA: string (nullable = true)
 |    |    |-- propB: string (nullable = true)
 |    |    |-- propC: string (nullable = true)
 |    |    |-- propD: string (nullable = true)

【讨论】：