递归网页抓取分页答案

【问题标题】：Recursive Web Scraping Pagination递归网页抓取分页
【发布时间】：2021-08-29 10:17:00
【问题描述】：

我正在尝试从以下网站上抓取一些房地产文章：

我设法获得了我需要的链接，但我在网页上的分页上苦苦挣扎。我正在尝试抓取“建立关系”、“建立你的团队”、“资本崛起”等每个类别下的每个链接.其中一些类别页面有分页，其中一些不包含分页。我尝试使用以下代码，但它只给了我来自 2 页的链接。

from requests_html import HTMLSession


def tag_words_links(url):
    global _session
    _request = _session.get(url)
    tags = _request.html.find('a.tag-cloud-link')
    links = []
    for link in tags:
        links.append({
             'Tags': link.find('a', first=True).text,
             'Links': link.find('a', first=True).attrs['href']
         })

    return links

def parse_tag_links(link):
    global _session
    _request = _session.get(link)
    articles = []
    try:
       next_page = _request.html.find('link[rel="next"]', first=True).attrs['href']
       _request = _session.get(next_page)
       article_links = _request.html.find('h3 a')
       for article in article_links:
           articles.append(article.find('a', first=True).attrs['href'])

    except:
        _request = _session.get(link)
        article_links = _request.html.find('h3 a')
        for article in article_links:
            articles.append(article.find('a', first=True).attrs['href'])


   return articles


if __name__ == '__main__':
   _session = HTMLSession()
   url = 'https://lifebridgecapital.com/podcast/'
   links = tag_words_links(url)
   print(parse_tag_links('https://lifebridgecapital.com/tag/multifamily/'))

【问题讨论】：

标签： python python-3.x web-scraping pagination python-requests-html

【解决方案1】：

要打印每个标签下的每篇文章的标题和标签下的每个页面，您可以使用以下示例：

import requests
from bs4 import BeautifulSoup


url = "https://lifebridgecapital.com/podcast/"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
tag_links = [a["href"] for a in soup.select(".tagcloud a")]

for link in tag_links:
    while True:
        print(link)
        print("-" * 80)

        soup = BeautifulSoup(requests.get(link).content, "html.parser")

        for title in soup.select("h3 a"):
            print(title.text)

        print()

        next_link = soup.select_one("a.next")
        if not next_link:
            break

        link = next_link["href"]

打印：

...

https://lifebridgecapital.com/tag/multifamily/
--------------------------------------------------------------------------------
WS890: Successful Asset Classes In The Current Market with Jerome Maldonado
WS889: How To Avoid A $1,000,000 Mistake with Hugh Odom
WS888: Value-Based On BRRRR VS Cap Rate with John Stoeber
WS887: Slow And Steady Still Wins The Race with Nicole Pendergrass
WS287: Increase Your NOI by Converting Units to Short Term Rentals with Michael Sjogren
WS271: Investment Strategies To Survive An Economic Downturn with Vinney Chopra
WS270: Owning a Construction Company Creates More Value with Abraham Ng’hwani
WS269: The Impacts of Your First Deal with Kyle Mitchell
WS260: Structuring Deals To Get The Best Return On Investment with Jeff Greenberg
WS259: Capital Raising For Newbies with Bryan Taylor

https://lifebridgecapital.com/tag/multifamily/page/2/
--------------------------------------------------------------------------------
WS257: Why Ground Up Development is the Best Investment with Sam Bates
WS256: Mobile Home Park Investing: The Real Deal with Jefferson Lilly
WS249: Managing Real Estate Paperwork Successfully with Krista Testani
WS245: Multifamily Syndication with Venkat Avasarala
WS244: Passive Investing In Real Estate with Kay Kay Singh
WS243: Getting Started In Real Estate Brokerage with Tyler Chesser
WS213: Data Analytics In Real Estate with Raj Tekchandani
WS202: Ben Leybovich and Sam Grooms on The Advantages Of A Partnership In Real Estate Business
WS199: Financial Freedom Through Real Estate Investing with Rodney Miller
WS197: Loan Qualifications: How The Whole Process Works with Vinney Chopra

https://lifebridgecapital.com/tag/multifamily/page/3/
--------------------------------------------------------------------------------
WS172: Real Estate Syndication with Kyle Jones

...

【讨论】：

当我尝试将数据（即链接、标题和标签名称）写入 csv 时遇到问题。实际上有 84 个标签名称和标题，每个链接都是 575 个，当我尝试对其进行索引时给我错误列表索引超出范围。知道如何处理吗？
@Abbas 为了不弄乱评论部分，我建议在 StackOverflow 上打开一个新问题。我会试着看看它......
不幸的是 StackOverflow 禁止我再问我问题。