【发布时间】:2021-08-29 10:17:00
【问题描述】:
我正在尝试从以下网站上抓取一些房地产文章:
我设法获得了我需要的链接,但我在网页上的分页上苦苦挣扎。我正在尝试抓取“建立关系”、“建立你的团队”、“资本崛起”等每个类别下的每个链接.其中一些类别页面有分页,其中一些不包含分页。我尝试使用以下代码,但它只给了我来自 2 页的链接。
from requests_html import HTMLSession
def tag_words_links(url):
global _session
_request = _session.get(url)
tags = _request.html.find('a.tag-cloud-link')
links = []
for link in tags:
links.append({
'Tags': link.find('a', first=True).text,
'Links': link.find('a', first=True).attrs['href']
})
return links
def parse_tag_links(link):
global _session
_request = _session.get(link)
articles = []
try:
next_page = _request.html.find('link[rel="next"]', first=True).attrs['href']
_request = _session.get(next_page)
article_links = _request.html.find('h3 a')
for article in article_links:
articles.append(article.find('a', first=True).attrs['href'])
except:
_request = _session.get(link)
article_links = _request.html.find('h3 a')
for article in article_links:
articles.append(article.find('a', first=True).attrs['href'])
return articles
if __name__ == '__main__':
_session = HTMLSession()
url = 'https://lifebridgecapital.com/podcast/'
links = tag_words_links(url)
print(parse_tag_links('https://lifebridgecapital.com/tag/multifamily/'))
【问题讨论】:
标签: python python-3.x web-scraping pagination python-requests-html