【问题标题】:Python 3 BeautifulSoup Scraping Content After "Read More" TextPython 3 BeautifulSoup 在“阅读更多”文本之后抓取内容
【发布时间】:2021-06-26 02:26:39
【问题描述】:

我最近开始考虑购买一些土地,我正在编写一个小应用程序来帮助我在 Jira/Confluence 中组织详细信息,以帮助我跟踪我与谁交谈过以及我与他们交谈过的内容分别针对每一块土地。

所以,我为landwatch(dot)com 写了这个小爬虫:

[url 只是网站上的一个列表]

from bs4 import BeautifulSoup
import requests


def get_property_data(url):
    headers = ({'User-Agent':
                    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
    response = requests.get(url, headers=headers)  # Maybe request Url with read more already gone
    soup = BeautifulSoup(response.text, 'html5lib')
    title = soup.find_all(class_='b442a')[0].text
    details = soup.find_all('p', class_='d19de')
    price = soup.find_all('div', class_='_260f0')[0].text
    deets = []
    for i in range(len(details)):
        if details[i].text != '':
            deets.append(details[i].text)
    detail = ''
    for i in deets:
        detail += '<p>' + i + '</p>'
    return [title, detail, price]

除了 d19de 类在 Read More 按钮后面隐藏了大量值之外,一切都很好。

在谷歌上搜索时,我发现了How to Scrape reviews with read more from Webpages using BeautifulSoup,但是我要么不明白他们在实现它方面做得足够好,要么这不再起作用了:

import requests ; from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
    items = title.get('href')
    if items:
        broth = BeautifulSoup(requests.get(items).text, "html.parser")
        for item in broth.select("div.user-review p.lnhgt"):
            print(item.text)

关于如何绕过Read More 按钮的任何想法?我真的希望在 BeautifulSoup 中做到这一点,而不是 selenium。

这是一个用于测试的示例 URL:https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:

    该数据存在于script 标记中。以下是提取该内容、使用json 解析并将土地描述信息输出为列表的示例:

    from bs4 import BeautifulSoup
    import requests, json
    
    url = 'https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403'
    headers = ({'User-Agent':
                        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
    response = requests.get(url, headers=headers)  # Maybe request Url with read more already gone
    soup = BeautifulSoup(response.text, 'html5lib')
    
    all_data = json.loads(soup.select_one('[type="application/ld+json"]').string)
    details = all_data['description'].split('\r\r') 
    

    您可能希望检查 script 标签中的其他内容:

    from pprint import pprint
    
    pprint(all_data)
    

    【讨论】:

      猜你喜欢
      • 2020-09-28
      • 2019-06-28
      • 1970-01-01
      • 2018-07-29
      • 1970-01-01
      • 1970-01-01
      • 2021-06-22
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多