【问题标题】:Importing multiple CSV files from Google Cloud Bucket to Datalab将多个 CSV 文件从 Google Cloud Bucket 导入 Datalab
【发布时间】:2019-03-17 01:09:01
【问题描述】:

我正在尝试获取以下代码,以便在 python 3.x 中将多个 csv 文件从 Google Cloud Bucket 导入 Datalab:

import google.datalab.storage as storage
import pandas as pd
from io import BytesIO

myBucket = storage.Bucket('some-bucket')
object_list = myBucket.objects(prefix='some-prefix')
df_list = []

for obj in object_list:
  %gcs read --object $obj.uri --variable data  
  df_list.append(pd.read_csv(BytesIO(data), compression='gzip'))

df = pd.concat(df_list, ignore_index=True)
df.head()

我在 for 循环的开头收到以下错误:

TypeError: a bytes-like object is required, not 'str'

我花了一些时间解决这个问题,但没有运气!任何帮助将不胜感激!

以下是整个回溯以防万一:

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/utils/_http.py in __init__(self, status, content)
     49       else:
---> 50         error = json.loads(str(content, encoding='UTF-8'))['error']
     51       if 'errors' in error:

/usr/local/envs/py3env/lib/python3.5/json/__init__.py in loads(s, 
encoding, cls, object_hook, parse_float, parse_int, parse_constant, 
object_pairs_hook, **kw)
    318             parse_constant is None and object_pairs_hook is None and not kw):
--> 319         return _default_decoder.decode(s)
    320     if cls is None:

/usr/local/envs/py3env/lib/python3.5/json/decoder.py in decode(self, s, _w)
    338         """
--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    340         end = _w(s, end).end()

/usr/local/envs/py3env/lib/python3.5/json/decoder.py in raw_decode(self, s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-9-6d51e52b6c6f> in <module>()
      7 df_list = []
      8 
----> 9 for obj in object_list:
     10   get_ipython().run_line_magic('gcs', 'read --object $obj.uri --variable data')
     11   df_list.append(pd.read_csv(BytesIO(data), compression='gzip'))

/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/utils/_iterator.py in __iter__(self)
     34     """Provides iterator functionality."""
     35     while self._first_page or (self._page_token is not None):
---> 36       items, next_page_token = self._retriever(self._page_token, self._count)
     37 
     38       self._page_token = next_page_token

/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/storage/_object.py in _retrieve_objects(self, page_token, _)
    319                                          page_token=page_token)
    320     except Exception as e:
--> 321       raise e
    322 
    323     objects = list_info.get('items', [])

/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/storage/_object.py in _retrieve_objects(self, page_token, _)
    317       list_info = self._api.objects_list(self._bucket,
    318                                          prefix=self._prefix, 
delimiter=self._delimiter,
--> 319                                          page_token=page_token)
    320     except Exception as e:
    321       raise e

/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/storage/_api.py in objects_list(self, bucket, prefix, delimiter, projection, versions, max_results, page_token)
    246 
    247     url = Api._ENDPOINT + (Api._OBJECT_PATH % (bucket, ''))
--> 248     return google.datalab.utils.Http.request(url, args=args, credentials=self._credentials)
    249 
    250   def objects_patch(self, bucket, key, info):

/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/utils/_http.py in request(url, args, data, headers, method, credentials, raw_response, stats)
    156           return json.loads(str(content, encoding='UTF-8'))
    157       else:
--> 158         raise RequestException(response.status, content)
    159     except ValueError:
    160       raise Exception('Failed to process HTTP response.')

/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/utils/_http.py in __init__(self, status, content)
     53       self.message += ': ' + error['message']
     54     except Exception:
---> 55       lines = content.split('\n') if isinstance(content, basestring) else []
     56       if lines:
     57         self.message += ': ' + lines[0]

TypeError: a bytes-like object is required, not 'str'

【问题讨论】:

    标签: python-3.x pandas google-cloud-platform google-cloud-storage google-cloud-datalab


    【解决方案1】:

    BytesIO 类要求参数是字节,而不是字符串:

    >>> from io import BytesIO
    >>> BytesIO(b'hi')
    <_io.BytesIO object at 0x1088aedb0>
    >>> BytesIO('hi')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: a bytes-like object is required, not 'str'
    

    如果你的data 是一个字符串而不是字节,你应该使用:

    from io import StringIO
    pd.read_csv(StringIO(data))
    

    【讨论】:

    • 我明白这一点,但在进入for 循环之前我得到了那个错误for obj in object_list: 行!我的理解是它在抱怨obj,但我不知道为什么!
    • 您能否在问题中包含整个回溯?
    • 这看起来像是 pydatalab 库的问题,我建议在 github.com/googledatalab/pydatalab/issues 提交问题
    猜你喜欢
    • 2018-10-16
    • 2020-06-13
    • 1970-01-01
    • 2016-08-25
    • 1970-01-01
    • 2021-06-07
    • 1970-01-01
    • 2017-11-08
    相关资源
    最近更新 更多