【问题标题】:BigQuery TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'BigQuery TypeError:to_pandas() 得到了一个意外的关键字参数“timestamp_as_object”
【发布时间】:2021-05-15 01:49:06
【问题描述】:

环境详情

  • 操作系统类型和版本:1.5.29-debian10
  • Python 版本:3.7
  • google-cloud-bigquery 版本:2.8.0

我正在配置一个 dataproc 集群,它将 BigQuery 中的数据获取到 pandas 数据帧中。 随着我的数据不断增长,我希望提高性能并听说过使用 BigQuery 存储客户端。

我过去遇到过同样的问题,通过将 google-cloud-bigquery 设置为 1.26.1 版本解决了这个问题。 如果我使用该版本,我会收到以下消息。

/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/client.py:407: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
 "Cannot create BigQuery Storage client, the dependency " 

代码 sn-p 执行但速度较慢。如果我不指定 pip 版本,我会遇到这个错误。

重现步骤

  1. 在 dataproc 上创建集群
gcloud dataproc clusters create testing-cluster  --region=europe-west1  --zone=europe-west1-b  --master-machine-type n1-standard-16  --single-node  --image-version 1.5-debian10  --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh  --metadata 'PIP_PACKAGES=elasticsearch google-cloud-bigquery google-cloud-bigquery-storage pandas pandas_gbq'
  1. 在集群上执行以下脚本
bqclient = bigquery.Client(project=project)
    job_config = bigquery.QueryJobConfig(
        query_parameters=[
            bigquery.ScalarQueryParameter("query_start", "STRING", str('2021-02-09 00:00:00')),
            bigquery.ScalarQueryParameter("query_end", "STRING", str('2021-02-09 23:59:59.99')),
        ]
    )
    df = bqclient.query(query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
2021-02-11 10:10:14,069 - preprocessing logger initialized
2021-02-11 10:10:14,069 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
Traceback (most recent call last):
  File "/tmp/782503bcc80246258560a07d2179891f/immo_preprocessing-pageviews_kyero.py", line 104, in <module>
    df = bqclient.query(base_query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
  File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1333, in to_dataframe
    date_as_object=date_as_object,
  File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
    df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
  File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'

使用 pandas-gbq 版本会出现同样的错误

    query_config = {
        'query': {
            'parameterMode': 'NAMED',
            'queryParameters': [
                {
                    'name': 'query_start',
                    'parameterType': {'type': 'STRING'},
                    'parameterValue': {'value': str('2021-02-09 00:00:00')}
                },
                {
                    'name': 'query_end',
                    'parameterType': {'type': 'STRING'},
                    'parameterValue': {'value': str('2021-02-09 23:59:59.99')}
                },
            ]
        }
    }
df = pd.read_gbq(base_query, configuration=query_config, progress_bar_type='tqdm',
                             use_bqstorage_api=True)
2021-02-11 09:21:19,532 - preprocessing logger initialized
2021-02-11 09:21:19,532 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
started
Downloading: 100%|██████████| 3107858/3107858 [00:14<00:00, 207656.33rows/s]
Traceback (most recent call last):
  File "/tmp/1830d5bcf198440e9e030c8e42a1b870/immo_preprocessing-pageviews.py", line 98, in <module>
    use_bqstorage_api=True)
  File "/opt/conda/default/lib/python3.7/site-packages/pandas/io/gbq.py", line 193, in read_gbq
    **kwargs,
  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 977, in read_gbq
    dtypes=dtypes,
  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 536, in run_query
    user_dtypes=dtypes,
  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 590, in _download_results
    **to_dataframe_kwargs
  File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
    df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
  File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'

https://github.com/googleapis/python-bigquery/issues/519

【问题讨论】:

    标签: python pandas google-bigquery


    【解决方案1】:

    Dataproc 默认安装 pyarrow 0.15.0,而 bigquery-storage-api 需要更新的版本。在安装时手动将 pyarrow 设置为 3.0.0 解决了这个问题。 话虽如此,PySpark 有一个 Pyarrow >= 0.15.0 的兼容性设置 https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-spark 我查看了 dataproc 的发行说明,这个 env 变量自 2020 年 5 月起被设置为默认值。

    【讨论】:

    • 可以确认!pip install pyarrow==3.0.0帮我解决了
    【解决方案2】:

    @Sam 回答了这个问题,但我想我只想提一下可操作的命令:

    在 Jupyter 笔记本中:

    !pip install pyarrow==3.0.0

    在你的虚拟环境中

    pip install pyarrow==3.0.0

    【讨论】:

      猜你喜欢
      • 2016-09-17
      • 2015-06-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-02-13
      • 2021-03-17
      • 2020-10-01
      • 2018-06-19
      相关资源
      最近更新 更多