带有搜索和非动态 URI 的 Python Web Scraping答案

【问题标题】：Python Web Scraping with search and non dynamic URI带有搜索和非动态 URI 的 Python Web Scraping
【发布时间】：2019-01-24 03:00:13
【问题描述】：

我是 python 和网络爬虫世界的初学者，我习惯于使用动态 URL 制作爬虫，当我在 URL 本身中输入特定参数时，URI 会发生变化。
例如：维基百科。
（如果我输入一个名为“Stack Overflow”的搜索，我将有一个如下所示的 URI：https://en.wikipedia.org/wiki/Stack_Overflow）

目前我面临的挑战是开发一个网络抓取工具来收集来自this page 的数据。

"Texto/Termos a serem pesquisados" 字段对应一个搜索字段，但是当我输入搜索时，URL 保持不变，无法让我为我的研究获取正确的 HTML 代码。

我习惯于使用 BeautifulSoup 和 Requests 进行抓取，但在这种情况下它没有用，因为搜索后 URL 保持不变。

import requests
from bs4 import BeautifulSoup

url = 'http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp'
html = requests.get(url)
bs0bj = BeautifulSoup(html.content,'html.parser')

print(bsObj)
# And from now on i cant go any further

通常我会做类似的事情

url = 'https://en.wikipedia.org/wiki/'
input = input('Input your search :)
search = url + input

然后做所有 BeautifulSoup 的事情，然后 findAll 事情来从 HTML 代码中获取我的数据。

我也尝试过使用 Selenium，但由于所有 webdriver 的原因，我正在寻找与此不同的东西。使用以下代码，我取得了一些奇怪的结果，但我仍然无法很好地抓取 HTML。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup

# Acess the page and input the search on the field

driver = webdriver.Chrome()
driver.get('http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp')
driver.switch_to.frame('main2')
busca = driver.find_element_by_id("txtTermo")
busca.send_keys("GESTAO DE PESSOAS")
#data_inicio = driver.find_element_by_id('dt_publ_ini')
#data_inicio.send_keys("01/01/2018")
#data_fim = driver.find_element_by_id('dt_publ_fim')
#data_fim.send_keys('20/12/2018')
botao = driver.find_element_by_id('ok')
botao.click()

考虑到所有这些：

有没有办法从这些静态 url 中抓取数据？
我可以通过代码在字段中输入搜索吗？
为什么我不能抓取正确的源代码？

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup python-requests

【解决方案1】：

问题是您的初始搜索页面使用框架进行搜索和结果，这使得BeautifulSoup 更难使用它。我能够通过使用稍微不同的 URL 和 MechanicalSoup 来获得搜索结果：

>>> from mechanicalsoup import StatefulBrowser
>>> sb = StatefulBrowser()
>>> sb.open('http://comprasnet.gov.br/ConsultaLicitacoes/ConsLicitacao_texto.asp')
<Response [200]>
>>> sb.select_form()  # select the search form
<mechanicalsoup.form.Form object at 0x7f2c10b1bc18>
>>> sb['txtTermo'] = 'search text'  # input the text to search for
>>> sb.submit_selected()  # submit the form
<Response [200]>
>>> page = sb.get_current_page()  # get the returned page in BeautifulSoup form
>>> type(page)
<class 'bs4.BeautifulSoup'>

请注意，我在这里使用的 URL 是具有搜索表单的框架的 URL，而不是您提供的内联它的页面。这移除了一层间接性。

MechanicalSoup 建立在BeautifulSoup 之上，并提供了一些与旧的mechanize 库类似的方式与网站交互的工具。

【讨论】：