【问题标题】:Scrape data from bloomberg从彭博社抓取数据
【发布时间】:2019-09-23 14:17:52
【问题描述】:

我想从彭博网站上抓取数据。 “IBVC:IND”下的数据 加拉加斯证券交易所股票市场指数”需要被刮掉。

到目前为止,这是我的代码:

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/58.0.3029.110 Safari/537.36 '
}
res = requests.get("https://www.bloomberg.com/quote/IBVC:IND", headers=headers)

soup = bs(res.content, 'html.parser')
# print(soup)
itmes = soup.find("div", {"class": "snapshot__0569338b snapshot"})

open_ = itmes.find("span", {"class": "priceText__1853e8a5"}).text
print(open_)
prev_close = itmes.find("span", {"class": "priceText__1853e8a5"}).text

我在 HTML 中找不到所需的值。我应该使用哪个库来处理它?我目前正在使用 BeautifulSoup 和 Requests。

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    如其他答案所示,内容是通过 JavaScript 生成的,因此不在纯 html 中。针对给定的问题,提出了两种不同的攻角

    • Selenium aka The Big Guns:这将让您在浏览器中自动执行几乎任何任务。虽然在速度方面需要付出一定的代价。
    • API Request aka 深思熟虑:这并不总是可行的。但是,如果是这种情况,则 效率会更高。

    我详细说明第二个。 @ViniciusDAvila 已经为这种解决方案制定了典型的蓝图:导航到站点,检查 Network 并确定哪个请求负责获取数据。

    完成后,剩下的就是执行问题了:

    刮板

    import requests
    import json
    from urllib.parse import quote
    
    
    # Constants
    HEADERS = {
        'Host': 'www.bloomberg.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
        'Accept': '*/*',
        'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
        'Accept-Encoding': 'gzip, deflate, br',
        'Referer': 'https://www.bloomberg.com/quote/',
        'DNT': '1',
        'Connection': 'keep-alive',
        'TE': 'Trailers'
    }
    URL_ROOT = 'https://www.bloomberg.com/markets2/api/datastrip'
    URL_PARAMS = 'locale=en&customTickerList=true'
    VALID_TYPE = {'currency', 'index'}
    
    
    # Scraper
    def scraper(object_id: str = None, object_type: str = None, timeout: int = 5) -> list:
        """
        Get the Bloomberg data for the given object.
        :param object_id: The Bloomberg identifier of the object.
        :param object_type: The type of the object. (Currency or Index)
        :param timeout: Maximal number of seconds to wait for a response.
        :return: The data formatted as dictionary.
        """
        object_type = object_type.lower()
        if object_type not in VALID_TYPE:
            return list()
        # Build headers and url
        object_append = '%s:%s' % (object_id, 'IND' if object_type == 'index' else 'CUR')
        headers = HEADERS
        headers['Referer'] += object_append
        url = '%s/%s?%s' % (URL_ROOT, quote(object_append), URL_PARAMS)
        # Make the request and check response status code
        response = requests.get(url=url, headers=headers)
        if response.status_code in range(200, 230):
            return response.json()
        return list()
    

    测试

    # Index
    object_id, object_type = 'IBVC', 'index'
    data = scraper(object_id=object_id, object_type=object_type)
    print('The open price for %s %s is: %d' % (object_type, object_id, data[0]['openPrice']))
    # The open price for index IBVC is: 50094
    
    # Exchange rate
    object_id, object_type = 'EUR', 'currency'
    data = scraper(object_id=object_id, object_type=object_type)
    print('The open exchange rate for USD per {} is: {}'.format(object_id, data[0]['openPrice']))
    # The open exchange rate for USD per EUR is: 1.0993
    

    【讨论】:

      【解决方案2】:

      因为所需的值是动态加载的。在这种情况下,您可以尝试使用 selenium 和 BeautifulSoup。这是一个示例代码供您参考:

      import time
      import os
      from selenium import webdriver
      from bs4 import BeautifulSoup
      
      # put the driver in the folder of this code
      driver = webdriver.Chrome(os.getcwd() + '/chromedriver')  
      
      driver.get("https://www.bloomberg.com/quote/IBVC:IND")
      time.sleep(3)
      real_soup = BeautifulSoup(driver.page_source, 'html.parser')
      open_ = real_soup.find("span", {"class": "priceText__1853e8a5"}).text
      print(f"Price: {open_}")
      time.sleep(3)
      driver.quit()
      

      输出:

      Price: 50,083.00
      

      你可以搜索chromedriver,根据你的chrome版本下载一个。

      【讨论】:

        【解决方案3】:

        由于这不是静态页面,您需要向 Bloomberg API 发出请求。要了解如何操作,请转到页面,检查元素并选择“网络”,然后按“XHR”过滤并查找 JSON 类型。重新加载页面。我做到了,相信这就是你想要的:link

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2021-10-21
          • 1970-01-01
          • 2013-04-28
          • 2022-01-19
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2016-06-05
          相关资源
          最近更新 更多