如何使用 Beautifulsoup 在 python 中抓取下一页答案

【问题标题】：How to scrape the next pages in python using Beautifulsoup如何使用 Beautifulsoup 在 python 中抓取下一页
【发布时间】：2016-07-01 04:03:49
【问题描述】：

假设我正在抓取一个 url

http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha

它不包含包含我要抓取的数据的页面。那么我怎样才能抓取所有下一页的数据。我正在使用 python 3.5.1 和 Beautifulsoup。注意：我不能使用 scrapy 和 lxml，因为它会给我一些安装错误。

【问题讨论】：

标签： python html web-scraping beautifulsoup html-parsing

【解决方案1】：

通过提取“转到最后一页”元素的page 参数来确定最后一页。并循环通过requests.Session() 维护网络抓取会话的每个页面：

import re

import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    # extract the last page
    response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")    
    soup = BeautifulSoup(response.content, "html.parser")
    last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1))

    # loop over every page
    for page in range(last_page):
        response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page)
        soup = BeautifulSoup(response.content, "html.parser")

        # print the title of every search result
        for result in soup.select("li.search-result"):
            title = result.find("div", class_="title").get_text(strip=True)
            print(title)

打印：

A C S College of Engineering, Bangalore
A1 Global Institute of Engineering and Technology, Prakasam
AAA College of Engineering and Technology, Thiruthangal
...

【讨论】：

谢谢我向你学习了很多。
嗨，alecxe - 非常感谢这个好主意和示例。-我在 MX--linux 上的 ATOM 上运行它：我得到了恼人的错误...` 回溯（最近一次通话最后）：文件“/tmp/atom_script_tempfiles/bb9dd230-6d13-11ea-905d-13b9ee9fe090”，第 9 行，在 engineering.careers NameError: name 'engineering' is not defined [Finished in 1.333s]`知道发生了什么在这里？