【问题标题】:Can't access Bluemix object store from my Notebook无法从我的笔记本访问 Bluemix 对象存储
【发布时间】:2016-05-05 23:24:45
【问题描述】:

我正在尝试使用 Python 将几个 JSON 文件从我的 Bluemix 对象存储读取到 Jupyter 笔记本中。我已经按照我找到的示例进行操作,但我仍然收到“没有这样的文件或目录”错误。

下面是验证对象存储和识别文件的代码:

# Set up Spark
from pyspark import SparkContext
from pyspark import SparkConf

if('config' not in globals()):
    config = SparkConf().setAppName('warehousing_sql').setMaster('local')
if('sc' not in globals()):
    sc= SparkContext(conf=config)

# Set the Hadoop configuration.
def set_hadoop_config(name, credentials):
    prefix = "fs.swift.service." + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + ".auth.url", credentials['auth_url']+'/v3/auth/tokens')
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials['project_id'])
    hconf.set(prefix + ".username", credentials['user_id'])
    hconf.set(prefix + ".password", credentials['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials['region'])
    hconf.setBoolean(prefix + ".public", True)

# Data Sources (generated by Insert to code)
    credentials = {
  'auth_url':'https://identity.open.softlayer.com',
  'project':'***',
  'project_id':'****',
  'region':'dallas',
  'user_id':'****',
  'domain_id':'****',
  'domain_name':'****',
  'username':'****',
  'password':"""****""",
  'filename':'Warehousing-data.json',
  'container':'notebooks',
  'tenantId':'****'
}

set_hadoop_config('spark', credentials)

# The data files should now be accessible through URLs of the form
# swift://notebooks.spark/filename.json

这里是调用代码:

...
resource_path= "swift://notebooks.spark/"
Warehousing_data_json = "Warehousing-data.json"
Warehousing_sales_data_nominal_scenario_json = "Warehousing-sales_data-nominal_scenario.json"
...

这是错误: IOError:[Errno 2] 没有这样的文件或目录:'swift://notebooks.spark/Warehousing-data.json'

如果这似乎是一个新手问题(我承认我是),我很抱歉,但我认为设置这个问题非常复杂,而且非常糟糕的形式依赖于未记录的方法 SparkContext._jsc.hadoopConfiguration()。


添加以响应霍伯特和斯文的 cmets:

谢谢霍伯特。我不明白您对“swift://notebooks**.spark**/”定义的评论,除非我误解了我遵循的示例的逻辑(这与 Sven 在他的回复中显示的基本相同),这路径来自对 sc._jsc.hadoopConfiguration() 的调用,但很难知道这个调用实际做了什么,因为没有记录 HadoopConfiguation 类。

我也不理解“为 Hadoop 配置使用/添加该定义”或“或者……使用 Spark 内部的 swift 客户端访问 JSON”的替代方法。我想我更喜欢后者,因为我在笔记本中没有使用 Hadoop。请指出这些替代方案的更详细说明。

谢谢斯文。你是对的,我没有显示 JSON 文件的实际读取。读取实际上发生在作为DOcplexcloud API 一部分的方法中。这是我笔记本中的相关代码:

resource_path= "swift://notebooks.spark/"
Warehousing_data_json = "Warehousing-data.json"
Warehousing_sales_data_nominal_scenario_json = "Warehousing-sales_data-nominal_scenario.json"

resp = client.execute(input= [{'name': "warehousing.mod",
                               'file': StringIO(warehousing_data_dotmod + warehousing_inputs + warehousing_dotmod + warehousing_outputs)},
                              {'name': Warehousing_data_json,
                               'filename': resource_path + Warehousing_data_json},
                              {'name': Warehousing_sales_data_nominal_scenario_json,
                               'filename': resource_path + Warehousing_sales_data_nominal_scenario_json}],
                      output= "results.json",
                      load_solution= True,
                      log= "solver.log",
                      gzip= True,
                      waittime= 300,
                      delete_on_completion= True)

这是堆栈跟踪:

IOError                                   Traceback (most recent call last)
<ipython-input-8-67cf709788b3> in <module>()
     29                       gzip= True,
     30                       waittime= 300,
---> 31                       delete_on_completion= True)
     32 
     33 result = WarehousingResult(json.loads(resp.solution.decode("utf-8")))

/gpfs/fs01/user/sbf1-4c17d3407da8d0-a7ea98a5cc6d/.local/lib/python2.7/site-packages/docloud/job.pyc in execute(self, input, output, load_solution, log, delete_on_completion, timeout, waittime, gzip, parameters)
    496         # submit job
    497         jobid = self.submit(input=input, timeout=timeout, gzip=gzip,
--> 498                             parameters=parameters)
    499         response = None
    500         completed = False

/gpfs/fs01/user/sbf1-4c17d3407da8d0-a7ea98a5cc6d/.local/lib/python2.7/site-packages/docloud/job.pyc in submit(self, input, timeout, gzip, parameters)
    436                                 gzip=gzip,
    437                                 timeout=timeout,
--> 438                                 parameters=parameters)
    439         # run model
    440         self.execute_job(jobid, timeout=timeout)

/gpfs/fs01/user/sbf1-4c17d3407da8d0-a7ea98a5cc6d/.local/lib/python2.7/site-packages/docloud/job.pyc in create_job(self, **kwargs)
    620                 self.upload_job_attachment(job_id, 
    621                                            attid=inp.name,
--> 622                                            data=inp.get_data(),
    623                                            gzip=gzip)
    624         return job_id

/gpfs/fs01/user/sbf1-4c17d3407da8d0-a7ea98a5cc6d/.local/lib/python2.7/site-packages/docloud/job.pyc in get_data(self)
    110         data = self.data
    111         if self.filename is not None:
--> 112             with open(self.filename, "rb") as f:
    113                 data = f.read()
    114         if self.file is not None:

IOError: [Errno 2] No such file or directory: 'swift://notebooks.spark/Warehousing-data.json'

当我在本地运行此笔记本时,它工作得很好,并且 resource_path 是我自己机器上的路径。

Sven,你的代码看起来和我的几乎一模一样,而且它与我复制的示例非常相似,所以我不明白为什么你的有效而我的无效。

我已验证文件存在于我的 Instance_objectstore 中。因此,swift://notebooks.spark/ 似乎没有指向这个对象库。从一开始,这对我来说是个谜。同样,HadoopConfiguation 类没有记录,因此无法知道它是如何在 URL 和 objectstore 之间建立关联的。

【问题讨论】:

  • 嘿,J. Bloom,好问题!查看您的资源路径时,我没有在上面的代码中看到“swift://notebooks**.spark**/”的定义。我可以尝试重新编写代码以使用/添加 Hadoop 配置的定义。或者,您也可以使用 Spark 内部的 swift 客户端来访问 JSON。您想采用哪种方法?
  • 在您的调用代码中,实际读取 json 文件时缺少一些代码行,例如sc.textFile()sqlContext.read.json()。您能否添加这些行以及堆栈跟踪的第一部分?

标签: python ibm-cloud pyspark jupyter-notebook


【解决方案1】:

您收到IOError: [Errno 2] No such file or directory: 'swift://notebooks.spark/Warehousing-data.json' 的错误消息意味着在该路径中没有这样的文件。我认为 Hadoop 配置的设置是成功的,否则您会收到一条不同的错误消息,抱怨缺少凭据设置。

我在 Bluemix 上的 Python 笔记本中测试了以下代码,它对我有用。我从最新的示例笔记本中获取了示例代码,展示了如何从 Bluemix Object Storage V3 加载数据。

Hadoop配置设置方法:

def set_hadoop_config(credentials):
    """This function sets the Hadoop configuration with given credentials, 
    so it is possible to access data using SparkContext"""

    prefix = "fs.swift.service." + credentials['name']
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + ".auth.url", credentials['auth_url']+'/v3/auth/tokens')
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials['project_id'])
    hconf.set(prefix + ".username", credentials['user_id'])
    hconf.set(prefix + ".password", credentials['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials['region'])
    hconf.setBoolean(prefix + ".public", True)

插入关联 Bluemix Object Storave V3 的凭据:

credentials_1 = {
  'auth_url':'https://identity.open.softlayer.com',
  'project':'***',
  'project_id':'***',
  'region':'dallas',
  'user_id':'***',
  'domain_id':'***',
  'domain_name':'***',
  'username':'***',
  'password':"""***""",
  'filename':'people.json',
  'container':'notebooks',
  'tenantId':'***'
}

使用给定的凭据设置 Hadoop 配置:

credentials_1['name'] = 'spark'
set_hadoop_config(credentials_1)

使用sc.textFile() 将 JSON 文件读入 RDD 并打印出前 5 行:

data_rdd = sc.textFile("swift://" + credentials_1['container'] + "." + credentials_1['name'] + "/" + credentials_1['filename'])
data_rdd.take(3)

输出:

[u'{"name":"Michael"}',
 u'{"name":"Andy", "age":30}',
 u'{"name":"Justin", "age":19}']

使用sqlContext.read.json()将JSON文件读入DataFrame并输出前5行:

data_df = sqlContext.read.json("swift://" + credentials_1['container'] + "." + credentials_1['name'] + "/" + credentials_1['filename'])
data_df.take(3)

输出:

[Row(age=None, name=u'Michael'),
 Row(age=30, name=u'Andy'),
 Row(age=19, name=u'Justin')]

【讨论】:

    【解决方案2】:

    我在https://developer.ibm.com/recipes/tutorials/using-ibm-object-storage-in-bluemix-with-python/ 找到了更好的解决方案,示例代码在https://github.com/saviosaldanha/IBM_Object_Store_Python_Example/blob/master/storage_recipe_example.py

    这是修改后的代码:

    import swiftclient
    from keystoneclient import client
    
    # Object Store credentials (generated by Insert to code)
    credentials = {
      'auth_url':'https://identity.open.softlayer.com',
      'project':'***',
      'project_id':'***',
      'region':'dallas',
      'user_id':'***',
      'domain_id':'***',
      'domain_name':'***',
      'username':'***',
      'password':"""***""",
      'filename':'Warehousing-data.json',
      'container':'notebooks',
      'tenantId':'***'
    }
    
    # Establish Connection to Bluemix Object Store
    connection = swiftclient.Connection(
        key=credentials[password],
        authurl=credentials[auth_url],
        auth_version='3',
        os_options={"project_id": credentials[projectId],
                    "user_id": credentials[userId],
                    "region_name": credentials[region]})
    
    # The data files should now be accessible through calls of the form
    # connection.get_object(credentials[container], fileName)[1]
    

    然后文件被访问为:

    Warehousing_data_json = "Warehousing-data.json"
    Warehousing_sales_data_nominal_scenario_json = "Warehousing-sales_data-nominal_scenario.json"
    
    resp = client.execute(input= [{'name': "warehousing.mod",
                                   'file': StringIO(warehousing_data_dotmod + warehousing_inputs + warehousing_dotmod + warehousing_outputs)},
                                  {'name': Warehousing_data_json,
                                   'filename': connection.get_object(credentials[container], Warehousing_data_json)[1]},
                                  {'name': Warehousing_sales_data_nominal_scenario_json,
                                   'filename': connection.get_object(credentials[container], Warehousing_sales_data_nominal_scenario_json)[1]}],
                                  output= "results.json",
                                  load_solution= True,
                                  log= "solver.log",
                                  gzip= True,
                                  waittime= 300,
                                  delete_on_completion= True)
    

    问题是如何在Bluemix 中加载库swiftclient 和keystoneclient? Pip 似乎在笔记本中不起作用。有人知道如何处理吗?

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-03-29
      • 2020-05-20
      • 2018-06-09
      • 2018-08-20
      • 2021-06-06
      • 1970-01-01
      相关资源
      最近更新 更多