【发布时间】:2020-05-07 18:22:11
【问题描述】:
我会将 JSON 文件中的数据导入 Google BigQuery,并想知道最好的做法是嵌套字段并添加模式模式为“重复”的字段以避免重复或保留重复的信息少嵌套。留下重复信息的另一个原因是,据我所知,BigQuery 最适合处理非规范化数据。但是,我不确定是否应该在数据导入之后或之前进行这种非规范化。
例如,假设我的数据是:
嵌套版本
{
"store": "Pete's Market",
"city": "NYC",
"product": [
{
"id": "2468",
"item": "apple",
"price": "$1"
},
{
"id": "1357",
"item": "cereal",
"price": "$3",
"brand": "Cheerios"
}
]
}
# The actual JSON data file will have this in one row:
# {"store":"Pete's Market","city":"NYC","product":[{"id":"2468","item":"apple","price":"$1"},{"id":"1357",item":"cereal","price":"$3","brand":"Cheerios"}]}
重复信息版
{
"store": "Pete's Market",
"city": "NYC",
"product":
{
"id": "2468",
"item": "apple",
"price": "$1"
}
}
{
"store": "Pete's Market",
"city": "NYC",
"product":
{
"id": "1357",
"item": "cereal",
"price": "$3",
"brand": "Cheerios"
}
}
# The actual JSON data file will have this in two rows:
# {"store":"Pete's Market","city":"NYC","product":{"id": "2468","item":"apple","price":"$1"}}
# {"store":"Pete's Market","city":"NYC","product":{"id":"1357",item":"cereal","price":"$3","brand":"Cheerios"}}
补充说明
可能有数千个products,在嵌套版本中模式重复的字段。一些products 可能有其他人没有的字段。每个 product 中的字段可以多嵌套 2-3 层。
问题重申
最好的做法是嵌套字段并添加具有模式模式“重复”的字段以避免重复或留下重复的信息?
【问题讨论】:
标签: json google-bigquery