Python网页抓取用户列表答案

【问题标题】：Python web scraping userlistPython网页抓取用户列表
【发布时间】：2018-01-09 07:56:25
【问题描述】：

我正在尝试从网站上抓取用户列表，但它有多个页面，我可以抓取第一个页面，但在抓取每个页面时卡住了。

代码-

from bs4 import BeautifulSoup
import requests

source = requests.get('example.com/users.php?page=1').text

soup = BeautifulSoup(source, 'lxml')

for profile in soup.select("li h3 a"):

    print(profile.text)

网址中的通知

page=1

下一页，是

page=2

等等，所以我的问题是我如何让python先抓取，然后是第二个等等。如果我可以给它分配一个页面限制会更有效，比如

 1-1000

所以它不会尝试超出页面并打到空白。

【问题讨论】：

你现在能刮example.com/users.php?page=2吗？如果答案是肯定的，您可以运行 for 循环来抓取具有范围的页面。
是的，我可以:)，但是翻页似乎很奇怪，每次都使用 + 20，所以第 1 页 = 0，第 2 页 = 20，第 3 页 = 40，第 4 页 = 60 等等

标签： python python-requests

【解决方案1】：

no_of_user_to_scrape = 20
for page_no in range(1, no_of_user_to_scrape):  # iterate over pages
    response = requests.get("http://example.com/users.php", params={"page": page_no}) # will construct url like http://example.com/users.php?page=page_no where page_no is iteration 1,2,3....
    # rest of the code goes here....
    soup = BeautifulSoup(response.text, 'lxml')

    for profile in soup.select("li h3 a"):
        print(profile.text)

【讨论】：

感谢您的回复，我已接受 Vaseem 的回答，因为这是我最初关注的答案（我第一次看到），但对于任何可能在同一问题后及时阅读此内容的人，这个答案与第一个一样有效。谢谢你的时间:)

【解决方案2】：

这样试试

from bs4 import BeautifulSoup
import requests
page_size = 0
for page_no in range(1,1000):

    source = requests.get('example.com/users.php?page={}'.format(page_size)).text
    page_size += 20
    soup = BeautifulSoup(source, 'lxml')
    for profile in soup.select("li h3 a"):
        print(profile.text)

【讨论】：

【解决方案3】：

如果您的爬虫也适用于 example.com/users.php?page=2，您可以简单地使用额外的 for 循环遍历这些页面。您将不知何故需要查看页面是否有任何条目，以便处理循环何时结束的条件。

【讨论】：