【问题标题】:How to convert RDD of JSONs to Dataframe?如何将 JSON 的 RDD 转换为 Dataframe?
【发布时间】:2018-03-30 06:59:34
【问题描述】:

我有一个从一些 JSON 创建的 RDD,RDD 中的每条记录都包含键/值对。我的 RDD 看起来像:

myRdd.foreach(println)
{"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1],
{"sequence":153,"id":8697389197662617,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637852762},1],
{"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637858607},1],
{"sequence":136,"id":8697374208897843,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637405129},1],
{"sequence":189,"id":8697413135394406,"trackingInfo":{"row":0,"trackId":14272744,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","rank":0,"videoId":80075830,"location":"PostPlay\/Next"},"type":["Play","Action","Session"],"time":527638558756},1],
{"sequence":130,"id":8697373887446384,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637394083}]

我会将每条记录转换为 spark 数据框中的一行,trackingInfo 中的嵌套字段应该有自己的列,type 列表也应该是自己的列。

到目前为止,我已经厌倦了使用案例类来拆分它:

case class Event(
    sequence: String, 
    id: String, 
    trackingInfo:String,
    location:String, 
    row:String, 
    trackId: String, 
    listrequestId: String, 
    videoId:String, 
    rank: String, 
    requestId: String, 
    `type`:String, 
    time: String)

val dataframeRdd = myRdd.map(line => line.split(",")).
    map(array => Event(
        array(0).split(":")(1),
        array(1).split(":")(1),
        array(2).split(":")(1),
        array(3).split(":")(1),
        array(4).split(":")(1),
        array(5).split(":")(1),
        array(6).split(":")(1),
        array(7).split(":")(1),
        array(8).split(":")(1),
        array(9).split(":")(1),
        array(10).split(":")(1),
        array(11).split(":")(1)
        ))

但是我不断收到java.lang.ArrayIndexOutOfBoundsException: 1 错误。

最好的方法是什么?如您所见,第 5 条记录在某些属性的排序上略有不同。是否可以根据属性名称进行解析,而不是根据“,”等进行解析。

我使用的是 Spark 1.6.x

【问题讨论】:

    标签: scala apache-spark apache-spark-sql scala-collections


    【解决方案1】:

    您的json rdd 似乎无效jsons。您需要将它们转换为有效的jsons as

    val validJsonRdd = myRdd.map(x => x.replace(",1],", ",").replace("}]", "}"))
    

    然后您可以使用sqlContext 将有效的rdd jsons 读入dataframe 作为

    val df = sqlContext.read.json(validJsonRdd)
    

    这应该给你数据框(我使用了你在问题中提供的无效 json)

    +----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
    |id              |sequence|time        |trackingInfo                                                                                                                             |type                   |
    +----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
    |8697344444103393|89      |527636408955|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
    |8697389197662617|153     |527637852762|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
    |8697389381205360|155     |527637858607|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
    |8697374208897843|136     |527637405129|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
    |8697413135394406|189     |527638558756|[null,PostPlay/Next,0,284929d9-6147-4924-a19f-4a308730354c-3348447,0,14272744,80075830]                                                  |[Play, Action, Session]|
    |8697373887446384|130     |527637394083|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
    +----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
    

    数据框的架构是

    root
     |-- id: long (nullable = true)
     |-- sequence: long (nullable = true)
     |-- time: long (nullable = true)
     |-- trackingInfo: struct (nullable = true)
     |    |-- listId: string (nullable = true)
     |    |-- location: string (nullable = true)
     |    |-- rank: long (nullable = true)
     |    |-- requestId: string (nullable = true)
     |    |-- row: long (nullable = true)
     |    |-- trackId: long (nullable = true)
     |    |-- videoId: long (nullable = true)
     |-- type: array (nullable = true)
     |    |-- element: string (containsNull = true)
    

    希望回答对你有帮助

    【讨论】:

      【解决方案2】:

      您可以使用 sqlContext.read.json(myRDD.map(_._2)) 将 json 读入数据帧

      【讨论】:

        猜你喜欢
        • 2017-10-30
        • 1970-01-01
        • 1970-01-01
        • 2020-01-24
        • 1970-01-01
        • 1970-01-01
        • 2019-09-09
        • 2016-07-17
        • 2015-12-08
        相关资源
        最近更新 更多