【发布时间】:2018-08-09 05:20:24
【问题描述】:
我有一个包含 200 万个字段的表,它在 Spark Dataframe 中注册。
表格是这样的:
CUSTADDRESSID ADDRESSTYPE ADDRESSLINE1 ADDRESSLINE2 ADDRESSLINE3 CITY STATE COUNTRY ZIP1 ISACTIVE ISCOMMUNICATION CREATEDDATE CREATEDUSER UPDATEDDATE UPDATEDUSER REASONCODE ZIP2 C_ACCNO CUSTOMERID ACCOUNTGROUPID PREPAIDACCOUNTSTATUSID PREPAIDACCOUNTSTATUSDATE SOURCEOFENTRY REVENUECATEGORYID VEHICLENUMBER VEHICLECLASS SERIALNO HEXTAGID TAGSTATUS TAGSTARTEFFDATE TAGENDEFFDATE ISTAGBLACKLISTED ISBLACKLISTHOLD RCVERIFICATIONSTATUS EMAILADDRESS PHONENUMBER CCreatedDate CCreatedUser CUpdatedDate CUpdatedUser HISTID ACTION ISFEEWAIVER FEEWAIVERPASSTYPE VEHICLEIMGVERIFICATIONSTATUS TAGTID ISREVENUERECHARGE RowNumber
41 Mailing B309 PROGRESSIVE SIGNATURE SECTOR-6 GHANSOLI NAVI MUMBAI MH IND 400701 1 1 2013-06-07 12:55:54.827 bhagwadapos 2013-06-07 12:55:54.827 bhagwadapos NULL NULL 10003014 20000001 15 3079 2015-09-16 14:58:27.500 RegularRetailer 75 MH43AJ411 4 206158433290 91890704803000000C0A TAGINACTIVE 2014-08-08 14:24:12.227 2039-08-08 23:59:59.000 1 0 NULL shankarn75@rediffmail.com 9004419178 2013-06-07 12:56:16.650 bhagwadapos 2015-09-16 14:58:33.190 BatchProcess 15250 UPDATE NULL NULL NULL NULL NULL 1
我想转成JSON,JSON文件大概是这个样子,不好意思我手头手动设计的:
ACCOUNTNO : 10003018
ADDRESS : Array
0 : Object
VEHICLE : Array
0 : Object
我已经编写了 Spark SQL 查询,但我无法在 ACCOUNTNO 下创建两个数组 VEHICLE 和 ADDRESS
所以这是查询:
val query2 = "SELECT C_ACCNO AS ACCOUNTNO, collect_set(struct(VEHICLENUMBER, CUSTOMERID,ACCOUNTGROUPID,PREPAIDACCOUNTSTATUSID,PREPAIDACCOUNTSTATUSDATE,SOURCEOFENTRY,REVENUECATEGORYID,VEHICLECLASS,SERIALNO,HEXTAGID,TAGSTATUS,TAGSTARTEFFDATE,TAGENDEFFDATE,ISTAGBLACKLISTED,ISBLACKLISTHOLD,RCVERIFICATIONSTATUS,EMAILADDRESS,PHONENUMBER,CREATEDDATE,CREATEDUSER,UPDATEDDATE,UPDATEDUSER,ISFEEWAIVER,FEEWAIVERPASSTYPE,VEHICLEIMGVERIFICATIONSTATUS,TAGTID,ISREVENUERECHARGE)) as VEHICLE FROM joined_acc_add GROUP BY ACCOUNTNO ORDER BY ACCOUNTNO"
之后:
val res01 = sqlContext.sql(query2.toString)
res01.coalesce(1).write.json("D:/result01")
我需要帮助来找出我在查询中的错误。此查询抛出错误。
【问题讨论】:
标签: apache-spark apache-spark-sql