由换行符分隔的普通 JSON 到 Bigquery JSON 要求答案

【问题标题】：Normal JSON to Bigquery JSON requirement separated by newline由换行符分隔的普通 JSON 到 Bigquery JSON 要求
【发布时间】：2021-12-07 07:09:57
【问题描述】：

我有一个长度超过 100,000+ 的字典列表。

我将如何将其转换为 JSON 并按照 Bigquery 的要求将其写入 JSON 文件以创建带有换行符的 JSON 文件。

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}

而不是

[{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}, {"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}]

注意两个 JSON 之间的区别：第一个是换行符分隔，而第二个是逗号分隔（Python 中的普通 JSON 转储）。我需要第一个。

我之前所做的是在循环的最后一部分，我正在这样做：

while condition:
     with open('cache/name.json', 'a') as a:
          json_data = json.dumps(store)
          a.write(json_data + '\n')

这样做，我根据字典列表的长度打开和编写，这使得循环变慢。

我怎样才能按照 bigquery 的要求以更快的方式插入它？

【问题讨论】：

嗨@Arci M，如果您觉得我的回答有帮助，请考虑按照Stack Overflow guidelines 接受/支持它。

标签： python json python-2.7 google-bigquery

【解决方案1】：

这种格式称为 NEWLINE_DELIMITED_JSON，bigquery 有内置库来加载它。考虑到您在 gs 存储桶中有 json，您可以使用以下内容：

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("name", "STRING"),
        bigquery.SchemaField("post_abbr", "STRING"),
    ],
    source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.json"

load_job = client.load_table_from_uri(
    uri,
    table_id,
    location="US",  # Must match the destination dataset location.
    job_config=job_config,
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))

【讨论】：