【问题标题】:Should my JSON data contain a nested field versus duplicate information?我的 JSON 数据是否应该包含嵌套字段而不是重复信息?
【发布时间】:2020-05-07 18:22:11
【问题描述】:

我会将 JSON 文件中的数据导入 Google BigQuery,并想知道最好的做法是嵌套字段并添加模式模式为“重复”的字段以避免重复或保留重复的信息少嵌套。留下重复信息的另一个原因是,据我所知,BigQuery 最适合处理非规范化数据。但是,我不确定是否应该在数据导入之后或之前进行这种非规范化。

例如,假设我的数据是:

嵌套版本

{
   "store": "Pete's Market",
   "city": "NYC",
   "product": [
      {
          "id": "2468",
          "item": "apple",
          "price": "$1"
      },
      {
          "id": "1357",
          "item": "cereal",
          "price": "$3",
          "brand": "Cheerios"
      }
   ]
}

# The actual JSON data file will have this in one row:
# {"store":"Pete's Market","city":"NYC","product":[{"id":"2468","item":"apple","price":"$1"},{"id":"1357",item":"cereal","price":"$3","brand":"Cheerios"}]}

重复信息版

{
   "store": "Pete's Market",
   "city": "NYC",
   "product":
      {
          "id": "2468",
          "item": "apple",
          "price": "$1"
      }
}
{
   "store": "Pete's Market",
   "city": "NYC",
   "product":
      {
          "id": "1357",
          "item": "cereal",
          "price": "$3",
          "brand": "Cheerios"
      }
}

# The actual JSON data file will have this in two rows:
# {"store":"Pete's Market","city":"NYC","product":{"id": "2468","item":"apple","price":"$1"}}
# {"store":"Pete's Market","city":"NYC","product":{"id":"1357",item":"cereal","price":"$3","brand":"Cheerios"}}

补充说明

可能有数千个products,在嵌套版本中模式重复的字段。一些products 可能有其他人没有的字段。每个 product 中的字段可以多嵌套 2-3 层。

问题重申

最好的做法是嵌套字段并添加具有模式模式“重复”的字段以避免重复或留下重复的信息?

【问题讨论】:

    标签: json google-bigquery


    【解决方案1】:

    如果您的数据是一次写入或仅追加,则只有轻微的可用性差异(如下所述),任何一种都可以正常工作,标准化的数据存储和扫描成本略低。

    轻微的可用性差异:

    -- With nested schema
    SELECT *  -- here all product's sub fields are entering parent namespace and no way to rename them in batch
    FROM table, UNNEST(product) prd
    

    但是如果你想稍后只更新某些产品,嵌套产品很难做到,因为没有办法更新重复字段中的单个项目,你必须加载整个数组,扫描和更改,然后重新打包更新列。您将编写如下内容:

    UPDATE table
    SET product = (SELECT ARRAY_AGG(...) FROM UNNEST(Product) WHERE ...)
    WHERE ... -- filter to specific products
    

    在这种情况下,非规范化的形式会更容易处理。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-06-11
      • 1970-01-01
      • 2015-10-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多