循环使用 Python 抓取多个 URL，但当我遍历网站页码时数据没有改变？答案

【问题标题】：Looping through scraping multiple URLs using Python but data isn't changing when I iterate through site page numbers?循环使用 Python 抓取多个 URL，但当我遍历网站页码时数据没有改变？
【发布时间】：2017-08-16 14:02:41
【问题描述】：

我正在使用 requests 和 beautifulsoup 进行网络抓取，当我尝试通过在每个循环中添加 1 个页码来循环访问多页留言板数据时，我得到了一些奇怪的结果。

下面的代码是一个示例，我在留言板上循环浏览第 1 页，然后循环浏览第 2 页。为了检查自己，我正在打印我正在点击的 URL，然后在该页面上找到的第一条记录. URL 看起来是正确的，但两者的第一篇文章是相同的。但如果我复制并粘贴这两个 URL，我肯定会在页面上看到一组不同的内容。

谁能告诉我这是我的代码有问题，还是与给我这些结果的论坛上的数据结构有关？提前致谢！

from bs4 import BeautifulSoup

import requests

n_pages = 2
base_link = 'http://tigerboard.com/boards/list.php?board=4&page='

for i in range (1,n_pages+1):
    link = base_link+str(i)
    html_doc = requests.get(link)
    soup = BeautifulSoup(html_doc.text,"lxml")
    bs_tags = soup.find_all("div",{"class":"msgline"})
    posts=[]
    for post in bs_tags:
        posts.append(post.text)
    print link
    print posts[0]

>     http://tigerboard.com/boards/list.php?board=4&page=1
>     52% of all websites are in English, but  - catbirdseat MU - 3/23/17 14:41:06
>     http://tigerboard.com/boards/list.php?board=4&page=2
>     52% of all websites are in English, but  - catbirdseat MU - 3/23/17 14:41:06

【问题讨论】：

posts 是当前页面上的帖子列表，而不是迄今为止看到的所有帖子的累积列表。
@JohnGordon 是的，在我的原始代码中，我在主循环之外定义了“帖子”列表，以便获得所有内容的运行记录，但出于故障排除的目的，我将其移到内部，以便每次之后都可以刷新页面。
使用range(2, n_pages + 1)，也只能从第一页获取结果。我尝试了各种方法来避免重定向，例如requests.get(link, allow_redirects=False) 以及已经讨论过的内容，例如here 和 here，但到目前为止还没有成功。

标签： python beautifulsoup python-requests

【解决方案1】：

该网站的实施是虚假的。出于某种原因，它需要设置特定的 cookie PHPSESSID，否则无论page 参数如何，它都不会返回第一页以外的其他页面。

设置此 cookie 可解决问题：

from bs4 import BeautifulSoup

import requests

n_pages = 2
base_link = 'http://tigerboard.com/boards/list.php?board=4&page='

for i in range (1,n_pages+1):
    link = base_link+str(i)
    html_doc = requests.get(link, headers={'Cookie': 'PHPSESSID=notimportant'})
    soup = BeautifulSoup(html_doc.text,"lxml")
    bs_tags = soup.find_all("div",{"class":"msgline"})
    posts=[]
    for post in bs_tags:
        posts.append(post.text)
    print link
    print posts[0]

另一种解决方案是使用session，因为（第一页的）第一个请求会将 cookie 设置为实际值，并将在以后的请求中发送。

调试很有趣！

【讨论】：

您先生是一位绅士和一位学者。谢谢！现在完美运行。
在浏览器中重现问题的一个有趣方法是在隐身窗口中打开tigerboard.com/boards/list.php?board=4&page=2，它将显示第一页
打得好，先生。