【发布时间】:2021-11-23 20:40:10
【问题描述】:
我有一个具有以下架构的 json 文件:
root
|-- count: long (nullable = true)
|-- results: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- address: string (nullable = true)
| | |-- auto_task_assignment: boolean (nullable = true)
| | |-- deleted_at: string (nullable = true)
| | |-- has_issues: boolean (nullable = true)
| | |-- has_timetable: boolean (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
| | |-- opening_hours: string (nullable = true)
| | |-- phone_number: string (nullable = true)
| | |-- position_id: long (nullable = true)
| | |-- show_technical_time: boolean (nullable = true)
| | |-- structure_id: long (nullable = true)
| | |-- subcontract_number: string (nullable = true)
| | |-- task_modification: boolean (nullable = true)
| | |-- updated_at: string (nullable = true)
我想解析结果数组以获取包含架构中列出的所有列的 DataFrame
尝试使用 select 语句时,出现错误。
df.select("results.*").show()
错误信息:
AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(results)`
你能帮我如何过滤这个json吗?
样本数据:
{'count': 11, 'next': None, 'previous': None, 'results': [{'id': 1, 'name': 'Samodzielny Publiczny Szpital Kliniczny Nr 1 PUM', 'external_id': None, 'structure_id': 1, 'address': '71-252 Szczecin, Ul. Unii Lubelskiej 1 ', 'phone_number': '+48123456789', 'opening_hours': 'pn-pt: 9:00-17:00', 'deleted_at': '2021-05-27T13:02:12.026410+02:00', 'updated_at': '2021-05-27T13:02:12.026417+02:00', 'position_id': None, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': False, 'has_issues': False, 'show_technical_time': False, 'subcontract_number': None}, {'id': 2, 'name': 'Szpital polowy we wrocławiu', 'external_id': None, 'structure_id': 2, 'address': 'North Montytown, 0861 Greenholt Crescent', 'phone_number': '+48505505505', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-18T16:15:06.608476+01:00', 'position_id': 49, 'has_timetable': True, 'auto_task_assignment': False, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919....191919919; 191919191991; 19991919919; 1919919 191919919; 191919191991; 19991919919; 1919919191919919; 191919191991; 19991919919; 1919919 191919919; 1919191-255c'}, {'id': 3, 'name': 'Test', 'external_id': None, 'structure_id': 17, 'address': 'ul. Śliczna', 'phone_number': '+48500100107', 'opening_hours': '', 'deleted_at': None, 'updated_at': '2021-11-04T14:22:04.712607+01:00', 'position_id': 33, 'has_timetable': True, 'auto_task_assignment': True, 'task_modification': True, 'has_issues': True, 'show_technical_time': True, 'subcontract_number': '07001234'}]}
我找到了使用 Pandas DataFrame 的解决方法,但我的目标是使用 Spark 来实现
enum = 0
for i in df['results']:
if enum == 0 :
df2 = pd.DataFrame(i, index=[0])
enum=+1
else:
df2 = df2.append(i, ignore_index=True)
预期的输出是保持列数在每一行上重复相同的值,并从结果结构中提取所有列,预期的架构如下:
root
|-- count: long (nullable = true)
|-- address: string (nullable = true)
|-- auto_task_assignment: boolean (nullable = true)
|-- deleted_at: string (nullable = true)
|-- has_issues: boolean (nullable = true)
|-- has_timetable: boolean (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- opening_hours: string (nullable = true)
|-- phone_number: string (nullable = true)
|-- position_id: long (nullable = true)
|-- show_technical_time: boolean (nullable = true)
|-- structure_id: long (nullable = true)
|-- subcontract_number: string (nullable = true)
|-- task_modification: boolean (nullable = true)
|-- updated_at: string (nullable = true)
【问题讨论】: