【问题标题】:ETL data from Bigquery to Redshift using Python使用 Python 从 Bigquery 到 Redshift 的 ETL 数据
【发布时间】:2016-10-10 12:12:04
【问题描述】:

我在 Python 中有这个脚本,它使用查询结果设置一个变量,在 Google Bigquery 中运行(我在这里不使用一些库,但我正在测试将 json 转换为 csv 文件):

import httplib2
import datetime
import json
import csv
import sys
from oauth2client.service_account import ServiceAccountCredentials
from bigquery import get_client


#Set DAY - 1
yesterday = datetime.datetime.now() - datetime.timedelta(days=1)
today = datetime.datetime.now()

#Format to Date
yesterday = '{:%Y-%m-%d}'.format(yesterday)
today = '{:%Y-%m-%d}'.format(today)


# BigQuery project id as listed in the Google Developers Console.
project_id = 'project'

# Service account email address as listed in the Google Developers Console.
service_account = 'email@email.com'


scope = 'https://www.googleapis.com/auth/bigquery'

credentials = ServiceAccountCredentials.from_json_keyfile_name('/path/to/file/.json', scope)

http = httplib2.Http()
http = credentials.authorize(http)


client = get_client(project_id, credentials=credentials, service_account=service_account)

#Synchronous query
try:
    _job_id, results = client.query("SELECT * FROM dataset.table WHERE CreatedAt >= PARSE_UTC_USEC('" + yesterday + "') and CreatedAt < PARSE_UTC_USEC('" + today + "') limit 1", timeout=1000)
except Exception as e:
    print e

print results

results 变量返回的结果是这样的:

[
{u'Field1': u'Msn', u'Field2': u'00000000000000', u'Field3': u'jsdksf422552d32', u'Field4': u'00000000000000', u'Field5': 1476004363.421, 
u'Field5': u'message', u'Field6': u'msn', 
u'Field7': None, 
u'Field8': u'{"user":{"field":"j23h4sdfsf345","field":"Msn","field":"000000000000000000","field":true,"field":"000000000000000000000","field":"2016-10-09T09:12:43.421Z"}}', u'Field9': 1476004387.016}
]

我需要在 Amazon Redshift 上加载它,但在这种格式下,我无法使用它生成的 .json 从 s3 运行副本...

有没有办法可以修改这个 json 以供 Redshift 加载?还是直接返回 .csv?我从 bigquery 或 python(我的第一个脚本之一)的这个库中了解的不多。

非常感谢!

【问题讨论】:

标签: python json csv etl


【解决方案1】:

删除字段前的“u”:

results = json.dumps(results)

然后,为了在 csv 文件中转换 json 变量,我创建了:

#Transform json variable to csv
results = json.dumps(results)

results = json.loads(results)

f = csv.writer(open("file.csv", "w"), delimiter='|')

f.writerow(["field","field","field","field","field","field","field", "field", "field", "field"])

for results in results:
    f.writerow([results["field"],
            results["field"],
            results["field"],
                results["field"],
            results["field"],
            results["field"],
            results["field"],
            results["field"],
            results["field"],
           results["field"]])

在此之后,我能够将文件加载到 Redshift。

【讨论】:

    猜你喜欢
    • 2018-04-25
    • 1970-01-01
    • 2017-02-01
    • 1970-01-01
    • 2018-05-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-03-24
    相关资源
    最近更新 更多