使用 Python 和 BeautifulSoup 抓取多个页面答案

【问题标题】：Scraping multiple pages with Python and BeautifulSoup使用 Python 和 BeautifulSoup 抓取多个页面
【发布时间】：2020-02-02 10:13:51
【问题描述】：

我正在尝试使用 BeautifulSoup 在 Python 中抓取许多页面，但没有积极的结果。

我尝试使用request.get() 和session.get()。我应该抓取的页数是 92。

import requests
from bs4 import BeautifulSoup
import urllib.request
with requests.Session as session:
    count = 0
    for i in range(92):
        count+=1
        page = "https://www.paginegialle.it/lazio/roma/dentisti/p-"+str(count)+".html"
        r = session.get(page)
        soup = BeautifulSoup(r.content)

使用print(page) 页面格式正确。但是执行soup来打印存储在变量中的所有值，只打印第一页的值。我正在使用一个 jupyter 笔记本

【问题讨论】：

“执行soup 打印所有值”是什么意思？您用来打印这些值的代码是什么？目前，您的代码每次都会在循环中简单地覆盖 soup 变量中的内容。
我的意思是我只是写汤并执行它（Shift+Enter）。
你在哪里/什么时候这样做？毕竟，你有一个循环。
在新的单元格中，在执行当前代码后
那么您应该始终在soup 中拥有最后页的内容。我认为您的代码基本上是正确的，但是您应该对循环中的soup in 进行一些处理。

标签： python beautifulsoup

【解决方案1】：

你可以这样做：

import requests
from bs4 import BeautifulSoup
import urllib.request

for i in range(92):
    url = "https://www.paginegialle.it/lazio/roma/dentisti/p-"+str(i)+".html"
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    p = soup.select('p')
    print(len(p))

【讨论】：

@Okenite - 重试上述
我重新编辑了我的代码，因为它基本上是正确的，但我在同一个单元格和for 循环中添加了其余代码。 p = soup.select('p') 对我需要的东西没有用。顺便说一句，我只检索了 2000 年的 177 个结果，所以我认为我现在没有考虑其他一些事情，可能是请求中的超时。我会进一步分析问题。

【解决方案2】：

这会起作用。

from bs4 import BeautifulSoup
import requests

count = 0
for i in range(92):
   count +=1
   source1 = requests.get("https://www.paginegialle.it/lazio/roma/dentisti/p-"+str(count)+".html").text 

   soup1 = BeautifulSoup(source1, 'lxml')

   print(soup1.body)
   print()
print("done")

【讨论】：

【解决方案3】：

另一种解决方案。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
count = 0
for i in range(92):
    count+=1
    html = req.get('https://www.paginegialle.it/lazio/roma/dentisti/p-'+str(i)+'.html') 
    doc = SimplifiedDoc(html)
    print(doc.select('title>text()'))
print (count)

【讨论】：