【发布时间】:2021-06-26 02:26:39
【问题描述】:
我最近开始考虑购买一些土地,我正在编写一个小应用程序来帮助我在 Jira/Confluence 中组织详细信息,以帮助我跟踪我与谁交谈过以及我与他们交谈过的内容分别针对每一块土地。
所以,我为landwatch(dot)com 写了这个小爬虫:
[url 只是网站上的一个列表]
from bs4 import BeautifulSoup
import requests
def get_property_data(url):
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
title = soup.find_all(class_='b442a')[0].text
details = soup.find_all('p', class_='d19de')
price = soup.find_all('div', class_='_260f0')[0].text
deets = []
for i in range(len(details)):
if details[i].text != '':
deets.append(details[i].text)
detail = ''
for i in deets:
detail += '<p>' + i + '</p>'
return [title, detail, price]
除了 d19de 类在 Read More 按钮后面隐藏了大量值之外,一切都很好。
在谷歌上搜索时,我发现了How to Scrape reviews with read more from Webpages using BeautifulSoup,但是我要么不明白他们在实现它方面做得足够好,要么这不再起作用了:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
items = title.get('href')
if items:
broth = BeautifulSoup(requests.get(items).text, "html.parser")
for item in broth.select("div.user-review p.lnhgt"):
print(item.text)
关于如何绕过Read More 按钮的任何想法?我真的希望在 BeautifulSoup 中做到这一点,而不是 selenium。
这是一个用于测试的示例 URL:https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403
【问题讨论】:
标签: python python-3.x web-scraping beautifulsoup