如何使用 requests 和 beautifulsoup 抓取页面而不进行分页？答案

【问题标题】：How to scrape a page without pagination with requests and beautifulsoup?如何使用 requests 和 beautifulsoup 抓取页面而不进行分页？
【发布时间】：2021-05-24 22:18:50
【问题描述】：

我正在抓取网页（使用 Python 请求和 beautifulsoup），我需要浏览项目列表的所有页面，但我需要单击下一页，代码只返回我的第 50 行代码到现在

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://sistemas.anatel.gov.br/se/public/view/b/licenciamento'
antenas = requests.get(url)

if antenas.status_code == 200:
print('Requisição bem sucedida!')
content = antenas.content

soup = BeautifulSoup(content, 'html.parser')
table = soup.find_all(name='table')

table_str = str(table)
df = pd.read_html(table_str)[0]

我的目标是自动从所有链接中抓取整个表格！

【问题讨论】：

“废弃”的意思是扔掉，丢弃。你应该使用'scrape'

标签： python beautifulsoup python-requests

【解决方案1】：

此页面使用对http://sistemas.anatel.gov.br/se/public/view/b/lic_table.php 的单独AJAX 请求来获取表，您可以使用浏览器调试工具（F12 -> 网络）看到。对于分页，似乎使用skip 表单参数传递了一个数字。

尝试分别获取每个页面，如下所示：

url = 'http://sistemas.anatel.gov.br/se/public/view/b/lic_table.php'

result_dfs = []

i = 0
while True:
  data = {
    'skip': i*50,
    'rpp': 50,
    'wfid': 'licencas'
  }
  r = requests.post(url, data=data)

  # process the results here...
  # df = ...

  # break when there are no more results
  if len(df.index) == 0:
    break

  result_dfs.append(df)

  i += 1

# put them all together
df = pd.concat(result_dfs)

还有一些其他的表单参数，不确定是否需要。

【讨论】：