【问题标题】:Extract html div class using BeautifulSoup使用 BeautifulSoup 提取 html div 类
【发布时间】:2018-06-06 16:15:59
【问题描述】:

我想从下面的 HTML 中获取“8.0”:

<div class="js-otelpuani" style="float: left;"> ==$0
 "8.0"
 <span class="greyish" style="font-size:13px; font-
 family:arial;"> /10</span>
 ::after
</div>

我已经尝试使用下面的代码在 div class= 'js-otelpuani' 中提取 '8.0',但它似乎不起作用;

import urllib
import requests
from bs4 import BeautifulSoup
import pyodbc

headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "tr,tr-TR,en-US,en;q=0.8",
}
r = requests.get('https://www.otelz.com/otel/elvin-deluxehotel#.WkDIBd9l_IU', headers=headers)
if r.status_code != 200:
    print("request denied")
else:
    print("ok")
    soup = BeautifulSoup(r.text) 
    score = soup.find('div',attrs={'class': 'js-otelpuani'})
    print(score)

我将这些作为输出,但不幸的是我无法获得想要提取的“8.0”值;

ok
<div class="js-otelpuani" style="float: left;">
<span id="comRatingValue">.0</span>
<span class="greyish" style="font-size: 13px; font-family: arial;">
/
<span itemprop="bestRating">10</span></span>
<span id="comRatingCount" itemprop="ratingCount" style="display: 
none;">0</span>
<span id="comReviewCount" itemprop="reviewCount" style="display: 
none;">0</span>
</div>

如果有任何帮助,我将不胜感激!

【问题讨论】:

    标签: python web-scraping beautifulsoup python-requests python-3.6


    【解决方案1】:

    如果您检查页面的 HTML 代码并搜索 js-otelpuani,您会注意到 script 标记内也使用了它,如果您遵循该脚本的逻辑,您会看到 评分本身是由对 GeneralPartial/Degerlendirmeler/8974 端点的单独查询形成的,其中 8974 是酒店 ID。

    让我们在您的脚本中模拟这个确切的逻辑 - 首先提取酒店 ID,发出单独的请求并提取评分值:

    import requests
    
    from bs4 import BeautifulSoup
    
    
    headers = {
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "accept-charset": "cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3",
        "accept-encoding": "gzip,deflate,sdch",
        "accept-language": "tr,tr-TR,en-US,en;q=0.8",
    }
    
    with requests.Session() as session:
        session.headers = headers
    
        r = session.get('https://www.otelz.com/otel/elvin-deluxehotel#.WkDIBd9l_IU', headers=headers)
        if r.status_code != 200:
            print("request denied")
        else:
            print("ok")
            soup = BeautifulSoup(r.text, "html.parser")
    
            # get the hotel id
            hotel_id = soup.find(attrs={"data-hotelid": True})["data-hotelid"]
    
            # go for the hotel rating
            response = session.get("https://www.otelz.com/GeneralPartial/Degerlendirmeler/{hotel_id}".format(hotel_id=hotel_id))
            soup = BeautifulSoup(response.text, "html.parser")
    
            rating_value = soup.find(attrs={'data-rating-value': True})['data-rating-value']
            print(rating_value)
    

    打印:

    8.0
    

    【讨论】:

      【解决方案2】:

      你可能应该使用这样的东西:

      soup.find('div', {'class' :'js-otelpuani'}).text
      

      【讨论】:

        【解决方案3】:

        如果您想使用 selenium,那么您所追求的数据可以很容易地解析,如下所示:

        from bs4 import BeautifulSoup
        from selenium  import webdriver
        
        driver = webdriver.Chrome()
        driver.get('https://www.otelz.com/otel/elvin-deluxehotel#.WkDf39KWa1t')
        soup = BeautifulSoup(driver.page_source,"lxml")
        for item in soup.select('.js-otelpuani'):
            [elem.extract() for elem in soup("span")]
            print(item.text)
        driver.quit()
        

        输出:

        8.0
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2013-01-29
          • 2015-04-13
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多