【问题标题】:Recursive Web Scraping Pagination递归网页抓取分页
【发布时间】:2021-08-29 10:17:00
【问题描述】:

我正在尝试从以下网站上抓取一些房地产文章:

Link

我设法获得了我需要的链接,但我在网页上的分页上苦苦挣扎。我正在尝试抓取“建立关系”、“建立你的团队”、“资本崛起”等每个类别下的每个链接.其中一些类别页面有分页,其中一些不包含分页。我尝试使用以下代码,但它只给了我来自 2 页的链接。

from requests_html import HTMLSession


def tag_words_links(url):
    global _session
    _request = _session.get(url)
    tags = _request.html.find('a.tag-cloud-link')
    links = []
    for link in tags:
        links.append({
             'Tags': link.find('a', first=True).text,
             'Links': link.find('a', first=True).attrs['href']
         })

    return links

def parse_tag_links(link):
    global _session
    _request = _session.get(link)
    articles = []
    try:
       next_page = _request.html.find('link[rel="next"]', first=True).attrs['href']
       _request = _session.get(next_page)
       article_links = _request.html.find('h3 a')
       for article in article_links:
           articles.append(article.find('a', first=True).attrs['href'])

    except:
        _request = _session.get(link)
        article_links = _request.html.find('h3 a')
        for article in article_links:
            articles.append(article.find('a', first=True).attrs['href'])


   return articles


if __name__ == '__main__':
   _session = HTMLSession()
   url = 'https://lifebridgecapital.com/podcast/'
   links = tag_words_links(url)
   print(parse_tag_links('https://lifebridgecapital.com/tag/multifamily/'))

【问题讨论】:

    标签: python python-3.x web-scraping pagination python-requests-html


    【解决方案1】:

    要打印每个标签下的每篇文章的标题和标签下的每个页面,您可以使用以下示例:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://lifebridgecapital.com/podcast/"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    tag_links = [a["href"] for a in soup.select(".tagcloud a")]
    
    for link in tag_links:
        while True:
            print(link)
            print("-" * 80)
    
            soup = BeautifulSoup(requests.get(link).content, "html.parser")
    
            for title in soup.select("h3 a"):
                print(title.text)
    
            print()
    
            next_link = soup.select_one("a.next")
            if not next_link:
                break
    
            link = next_link["href"]
    

    打印:

    ...
    
    https://lifebridgecapital.com/tag/multifamily/
    --------------------------------------------------------------------------------
    WS890: Successful Asset Classes In The Current Market with Jerome Maldonado
    WS889: How To Avoid A $1,000,000 Mistake with Hugh Odom
    WS888: Value-Based On BRRRR VS Cap Rate with John Stoeber
    WS887: Slow And Steady Still Wins The Race with Nicole Pendergrass
    WS287: Increase Your NOI by Converting Units to Short Term Rentals with Michael Sjogren
    WS271: Investment Strategies To Survive An Economic Downturn with Vinney Chopra
    WS270: Owning a Construction Company Creates More Value with Abraham Ng’hwani
    WS269: The Impacts of Your First Deal with Kyle Mitchell
    WS260: Structuring Deals To Get The Best Return On Investment with Jeff Greenberg
    WS259: Capital Raising For Newbies with Bryan Taylor
    
    https://lifebridgecapital.com/tag/multifamily/page/2/
    --------------------------------------------------------------------------------
    WS257: Why Ground Up Development is the Best Investment with Sam Bates
    WS256: Mobile Home Park Investing: The Real Deal with Jefferson Lilly
    WS249: Managing Real Estate Paperwork Successfully with Krista Testani
    WS245: Multifamily Syndication with Venkat Avasarala
    WS244: Passive Investing In Real Estate with Kay Kay Singh
    WS243: Getting Started In Real Estate Brokerage with Tyler Chesser
    WS213: Data Analytics In Real Estate with Raj Tekchandani
    WS202: Ben Leybovich and Sam Grooms on The Advantages Of A Partnership In Real Estate Business
    WS199: Financial Freedom Through Real Estate Investing with Rodney Miller
    WS197: Loan Qualifications: How The Whole Process Works with Vinney Chopra
    
    https://lifebridgecapital.com/tag/multifamily/page/3/
    --------------------------------------------------------------------------------
    WS172: Real Estate Syndication with Kyle Jones
    
    ...
    

    【讨论】:

    • 当我尝试将数据(即链接、标题和标签名称)写入 csv 时遇到问题。实际上有 84 个标签名称和标题,每个链接都是 575 个,当我尝试对其进行索引时给我错误列表索引超出范围。知道如何处理吗?
    • @Abbas 为了不弄乱评论部分,我建议在 StackOverflow 上打开一个新问题。我会试着看看它......
    • 不幸的是 StackOverflow 禁止我再问我问题。
    猜你喜欢
    • 1970-01-01
    • 2014-09-15
    • 1970-01-01
    • 2020-06-18
    • 1970-01-01
    • 2015-07-17
    • 2016-09-05
    • 1970-01-01
    相关资源
    最近更新 更多