Google BigQuery 批量查询 - 作业完成后作业状态未更新答案

【问题标题】：Google BigQuery Batch Query - job state is not updating after the job finishedGoogle BigQuery 批量查询 - 作业完成后作业状态未更新
【发布时间】：2022-10-25 20:37:16
【问题描述】：

我在 jupyter notebook 的 python 脚本中运行 google BigQuery 批处理查询。通常，在交互模式下运行时，查询大约需要一个小时。今天早上我检查了一下，脚本仍然显示作业状态为RUNNING - 16 小时后。所以我检查了INFORMATION_SCHEMA.JOBS，它说作业已经处于状态DONE，执行过程中没有错误，查询花了大约一个小时（我还有另一个在python中“运行”，哪个状态调查INFORMATION_SCHEMA.JOBS) 时返回错误。

所以我中断内核并检查：我存储结果的数据帧已填充，所以我已经得到了结果但状态仍然显示running。

在我再次明确要求这份工作后：

query_job_test = client.get_job(
    'my_job_id', location='my_location'
)

我得到了正确的状态DONE。

我做错了什么？即使工作已经完成，如何防止我的脚本被卡住？

请参阅下面的代码 sn-ps：

调查INFORMATION_SCHEMA.JOBS:

SELECT
  *
FROM
  my_project_id.region-my_region.INFORMATION_SCHEMA.JOBS
WHERE
  job_id = 'my_job_id'

运行批处理查询的 Python 脚本：

key_path = "../path_to_my_credentials.json"

credentials = service_account.Credentials.from_service_account_file(
    key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

client = bigquery.Client(credentials=credentials, project=credentials.project_id,)

job_config = bigquery.QueryJobConfig(
    priority=bigquery.QueryPriority.BATCH
)

query = """ SELECT * from my_table """

def set_df(future):
    global df
    df= future.to_dataframe()

query_job = client.query(query, job_config=job_config)
query_job.add_done_callback(set_df)

query_job = client.get_job(
    query_job.job_id, location=query_job.location
) 


while(query_job.state != 'DONE'):
    time.sleep(60)

print(df.head())

更新：作为一种解决方法，我将脚本更改为：

def set_df(future):
    global df_all
    global is_done
    is_done = True
    df_all = future.to_dataframe()

while(not 'is_done' in globals()):
    time.sleep(60)

del is_done
print(df_all.head())

但是，对于我所有较长的查询，我都遇到了与工作状态相同的问题。

【问题讨论】：

标签： python google-bigquery jobs information-schema

【解决方案1】：

您没有在 while 循环中更新作业。在循环中添加一个 client.get_job 以获得更新的状态，它应该可以工作：

while(query_job.state != 'DONE'):
    time.sleep(60)
    query_job = client.get_job(
        query_job.job_id, location=query_job.location
    )

【讨论】：

太谢谢了！这绝对可能是原因。

【解决方案2】：

添加到@Ingar Pedersen 的答案很高兴知道，虽然状态可以从完成的工作中返回“完成”，但它也可能有错误。所以考虑到这一点：

while(query_job.state != 'DONE'):
    time.sleep(60)
    if query_job.errors is not None:
        raise Exception("Bigquery job failed with error {}".format(query_job.errors)) 
    query_job = client.get_job(query_job.job_id, location=query_job.location)

https://cloud.google.com/bigquery/docs/managing-jobs#bq

【讨论】：