如何在 Python 中运行 BigQuery 查询答案

【问题标题】：How to run a BigQuery query in Python如何在 Python 中运行 BigQuery 查询
【发布时间】：2017-12-13 16:52:36
【问题描述】：

这是我一直在 BigQuery 中运行的查询，我想在我的 python 脚本中运行。我将如何更改它/我必须添加什么才能让它在 Python 中运行。

#standardSQL
SELECT
  Serial,
  MAX(createdAt) AS Latest_Use,
  SUM(ConnectionTime/3600) as Total_Hours,
  COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;

根据我的研究，我无法使用 Python 将此查询保存为永久表。真的吗？如果是真的，是否仍然可以导出临时表？

【问题讨论】：

标签： python google-bigquery

【解决方案1】：

您需要使用BigQuery Python client lib，然后这样的东西应该可以让您启动并运行：

from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
query = "SELECT...."
dataset = client.dataset('dataset')
table = dataset.table(name='table')
job = client.run_async_query('my-job', query)
job.destination = table
job.write_disposition= 'WRITE_TRUNCATE'
job.begin()

https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html

查看当前的BigQuery Python client tutorial。

【讨论】：

googlecloudplatform.github.io/google-cloud-python/stable/… 链接返回 404 错误
由于这是公认的答案，我将在此处添加。您需要将 job_config 设置 use_legacy_sql 指定为 False 以运行 OP 的查询。默认为True。例如python job_config = bigquery.QueryJobConfig() job_config.use_legacy_sql = False client.query(query, job_config=job_config)

【解决方案2】：

这是一个很好的使用指南： https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/index.html

简单地运行和编写查询：

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'your_dataset_id'

job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table("your_table_id")
job_config.destination = table_ref
sql = """
    SELECT corpus
    FROM `bigquery-public-data.samples.shakespeare`
    GROUP BY corpus;
"""

# Start the query, passing in the extra configuration.
query_job = client.query(
    sql,
    # Location must match that of the dataset(s) referenced in the query
    # and of the destination table.
    location="US",
    job_config=job_config,
)  # API request - starts the query

query_job.result()  # Waits for the query to finish
print("Query results loaded to table {}".format(table_ref.path))

【讨论】：

【解决方案3】：

我个人更喜欢使用 pandas 进行查询：

# BQ authentication
import pydata_google_auth
SCOPES = [
    'https://www.googleapis.com/auth/cloud-platform',
    'https://www.googleapis.com/auth/drive',
]

credentials = pydata_google_auth.get_user_credentials(
    SCOPES,
    # Set auth_local_webserver to True to have a slightly more convienient
    # authorization flow. Note, this doesn't work if you're running from a
    # notebook on a remote sever, such as over SSH or with Google Colab.
    auth_local_webserver=True,
)

query = "SELECT * FROM my_table"

data = pd.read_gbq(query, project_id = MY_PROJECT_ID, credentials=credentials, dialect = 'standard')

【讨论】：

【解决方案4】：

pythonbq 包使用非常简单，是一个很好的起点。它使用 python-gbq。

要开始使用，您需要为外部应用程序访问生成一个 BQ json 密钥。您可以生成您的密钥here。

您的代码如下所示：

from pythonbq import pythonbq

myProject=pythonbq(
  bq_key_path='path/to/bq/key.json',
  project_id='myGoogleProjectID'
)
SQL_CODE="""
SELECT
  Serial,
  MAX(createdAt) AS Latest_Use,
  SUM(ConnectionTime/3600) as Total_Hours,
  COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
"""
output=myProject.query(sql=SQL_CODE)

【讨论】：

【解决方案5】：

下面是另一种为服务帐户使用 JSON 文件的方法：

>>> from google.cloud import bigquery
>>>
>>> CREDS = 'test_service_account.json'
>>> client = bigquery.Client.from_service_account_json(json_credentials_path=CREDS)
>>> job = client.query('select * from dataset1.mytable')
>>> for row in job.result():
...     print(r)

【讨论】：